2.2 Wrangling in the tidyverse

(Here’s a great reference, or “cheat sheet”, that is very useful to refer to once you have learnt the tidyverse functions and want to refer back to it, although it’s a little outdated: https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf. Do note that pivot_longer(), covered below, is an updated version of gather() from the pdf.)

In this section we’ll briefly discuss using functions from the tidyverse package, which is a very useful package for handling data. It is a very rich package (with an accompanying book), with too much content to cover here, so we’ll only introduce some very basic functions.

If you would like to follow along, use the following code to generate the toy data that we’ll use in this example.

df_wide1 = data.frame(
  Name = c("Angela", "Brian", "Cathy", "Daniel", "Elaine", "Frank", "Grace", "Ben", "Chloe", "James", "Natalie", "Matthew", "Samantha", "David", "Vivian", "Joseph", "Lisa", "Mark", "Jane", "Jason"),
  Gender = rep(c("Female", "Male"), 10),
  Height = c(116, 110, 121, 117, 111, 114, 127, 116, 121, 116, 126, 113, 122, 113, 116, 115, 123, 122, 115, 118)
)

set.seed(1)
df_wide = data.frame(
  Name = df_wide1$Name,
  Gender = df_wide1$Gender,
  Height_Pri1 = df_wide1$Height,
  Height_Pri2 = df_wide1$Height + ceiling(rnorm(20, 5, 2))
)
df_wide$Height_Pri3 = df_wide$Height_Pri2 + ceiling(rnorm(20, 5, 2))

The %>% operator

First, we’ll introduce the %>% operator, which I refer to as the pipe operator1. The %>% operator ‘pipes’ the argument before it, into the function after it, as the first argument (by default).

  1. A %>% function is equivalent to calling function(A). The variable before %>% is by default the first argument into function(), and you can list other arguments too.

  2. A %>% function(B, C) is equivalent to calling function(A,B,C).

The nice thing about the %>% operator, is that you can chain many operations together, while maintaining readability of the code. For example,

  1. A %>% step1(B,C) %>% step2(D) %>% step3() means: Take A, apply Step1 (with other arguments B, C), and then apply Step2 with additional argument D, and then2 apply Step3. Overall, this is much easier to read as the code reads sequentially from left to right, as the code “happens”. Contrast this with: step3(step2(step1(A,B,C),D)).

NOTE: if you like, you can pipe arguments into the function at a specified position, using .. For example, B %>% function(A, ., C) is equivalent to calling function(A, B, C).

Wide-to-long: pivot_longer()

The function to transform wide data to long data is pivot_longer() (Documentation). Its first argument, like a lot of the tidyverse functions, is the data frame that we want to manipulate, and is handled by the %>% operator. Recall that the wide-form data looks like the following, with the columns we want to modify in bold:

Name Gender Height_Pri1 Height_Pri2 Height_Pri3
Angela Female 116 120 127
Brian Male 110 116 123
Cathy Female 121 125 131

The second set of arguments to pivot_longer() are the columns we want to modify. In this case, they are “Height_Pri1”, “Height_Pri2”, and “Height_Pri3”. Note that these need to be concatenated into a single vector, so use c(...) to concatenate them.

The last two arguments that we need, which are named arguments, are names_to and values_to, which give the names of the ‘key’ column and the ‘value’ column in the output long-form data frame. Let’s try the following, where I pipe df_wide into a pivot_longer() command, and save the output to a new data frame called df_long_1

df_long_1 = df_wide %>% 
  pivot_longer(c("Height_Pri1", "Height_Pri2", "Height_Pri3"), 
               names_to = "Variable", values_to = "Height") 

Let’s take a look at what this produces:

# output several selected rows
pander(df_long_1[c(1:6),], emphasize.strong.cols=3:4)
Name Gender Variable Height
Angela Female Height_Pri1 116
Angela Female Height_Pri2 120
Angela Female Height_Pri3 127
Brian Male Height_Pri1 110
Brian Male Height_Pri2 116
Brian Male Height_Pri3 123

This looks close! Notice how the names_to and values_to arguments became the names of the columns? Note also how “Name” and “Gender” variables get copied automatically?

mutate()

The last step we want to do to clean this up is to rename the Variable column and the variables in that column. For example, we want to change Height_Pri1 to something more readable, like maybe the number 1. But instead of renaming, I want to introduce the function mutate() which creates new variables.

  • mutate()’s first argument is the data frame, which again is handled by %>%
  • mutate()’s subsequent arguments follow the format new_variable = operation(). You can also stack many such operations to create many variables at the same time. For example, mutate(newVar1 = operation(), newVar2 = operation(), newVar3 = ...)

Let’s use the factor() operation to create a new factor using the values in the Variable column. We specify the levels of the factor as the original values of the Variable column, and then we use labels to rename what these values will be called in the new variable.

Specifically, let’s use mutate() to make a new variable called School_Level, which will just have values of 1, 2, 3 to refer to Primary 1, 2 or 3:

df_long_2 = df_long_1 %>% 
  mutate(School_Level = factor(Variable, 
                               levels=c("Height_Pri1", "Height_Pri2", "Height_Pri3"),
                               labels=c(1, 2, 3))) 
This produces:
Name Gender Variable Height School_Level
Angela Female Height_Pri1 116 1
Angela Female Height_Pri2 120 2
Angela Female Height_Pri3 127 3
Brian Male Height_Pri1 110 1
Brian Male Height_Pri2 116 2
Brian Male Height_Pri3 123 3

Which is great, we now have School_Level, which is a lot more readable than Variable. Finally, we can use select() to choose the names of the columns that we want to keep. Let’s just keep School_Level instead of Variable, since they have the same information (We can also use this function to re-order the columns; note the order in the following function call).

df_long_3 = df_long_2 %>% select(Name, Gender, School_Level, Height)

Which finally produces what we want:

Name Gender School_Level Height
Angela Female 1 116
Angela Female 2 120
Angela Female 3 127
Brian Male 1 110
Brian Male 2 116
Brian Male 3 123

Thus, putting it all together, we can perform all the previous steps using:

df_long = df_wide %>% 
  pivot_longer(c("Height_Pri1", "Height_Pri2", "Height_Pri3"), 
               names_to = "Variable", values_to = "Height") %>% 
  mutate(School_Level = factor(Variable, 
                               levels=c("Height_Pri1", "Height_Pri2", "Height_Pri3"),
                               labels=c(1, 2, 3))) %>% 
  select(Name, Gender, School_Level, Height)

## equivalent to:
#
# df_long_1 = df_wide %>% pivot_longer(c("Height_Pri1", "Height_Pri2", "Height_Pri3"), names_to = "Variable", values_to = "Height") 
# 
# df_long_2 = df_long_1 %>% 
#   mutate(School_Level = factor(Variable, 
#                                levels=c("Height_Pri1", "Height_Pri2", "Height_Pri3"),
#                                labels=c(1, 2, 3))) 
# 
# df_long_3 = df_long_2 %>% select(Name, Gender, School_Level, Height)
#

Long-to-wide: pivot_wider()

Finally, let’s try to go backwards, from a long-form data frame to a wide-form data frame. Let’s assume we start with df_long made from the previous section, and we want to spread it back to a wide format. The relevant function is pivot_wider(). (Documentation)

  • The first argument is the data frame, and is handled by %>%.
  • The second argument is the id columns; these are the variables that identify the observation. In this case, it is the Name and Gender columns, since they stick with each observation.
  • Then, we have names_from and values_from, which specify the names of the columns that we want to take the values from to spread them out.

Let’s try:

df_wide_test1 = df_long %>% pivot_wider(
  id_cols = c("Name", "Gender"),
  names_from = "School_Level",
  values_from = "Height")

Which gives us:

Name Gender 1 2 3
Angela Female 116 120 127
Brian Male 110 116 123
Cathy Female 121 125 131

Great! We got back a wide-form data frame. But notice how the columns are now labelled 1, 2, 3, and this makes it hard to understand what they mean? We’ll leave it as an exercise to the reader to try to convert these back into something more understandable. (There are several possible solutions!)


  1. As it behaves similar to other pipe operators in Python and Unix, the creator pronounces %>% as “and then”: Source: https://community.rstudio.com/t/how-is-pronounced/1783/12↩︎

  2. That’s why some people call %>% “and then”↩︎