2.2 Wrangling in the tidyverse
(Here’s a great reference, or “cheat sheet”, that is very useful to refer to once you have learnt the tidyverse
functions and want to refer back to it, although it’s a little outdated:
https://github.com/rstudio/cheatsheets/raw/master/data-transformation.pdf. Do note that pivot_longer()
, covered below, is an updated version of gather()
from the pdf.)
In this section we’ll briefly discuss using functions from the tidyverse
package, which is a very useful package for handling data. It is a very rich package (with an accompanying book), with too much content to cover here, so we’ll only introduce some very basic functions.
If you would like to follow along, use the following code to generate the toy data that we’ll use in this example.
df_wide1 = data.frame(
Name = c("Angela", "Brian", "Cathy", "Daniel", "Elaine", "Frank", "Grace", "Ben", "Chloe", "James", "Natalie", "Matthew", "Samantha", "David", "Vivian", "Joseph", "Lisa", "Mark", "Jane", "Jason"),
Gender = rep(c("Female", "Male"), 10),
Height = c(116, 110, 121, 117, 111, 114, 127, 116, 121, 116, 126, 113, 122, 113, 116, 115, 123, 122, 115, 118)
)
set.seed(1)
df_wide = data.frame(
Name = df_wide1$Name,
Gender = df_wide1$Gender,
Height_Pri1 = df_wide1$Height,
Height_Pri2 = df_wide1$Height + ceiling(rnorm(20, 5, 2))
)
df_wide$Height_Pri3 = df_wide$Height_Pri2 + ceiling(rnorm(20, 5, 2))
The %>%
operator
First, we’ll introduce the %>%
operator, which I refer to as the pipe operator1. The %>%
operator ‘pipes’ the argument before it, into the function after it, as the first argument (by default).
A %>% function
is equivalent to callingfunction(A)
. The variable before%>%
is by default the first argument intofunction()
, and you can list other arguments too.A %>% function(B, C)
is equivalent to callingfunction(A,B,C)
.
The nice thing about the %>%
operator, is that you can chain many operations together, while maintaining readability of the code. For example,
A %>% step1(B,C) %>% step2(D) %>% step3()
means: TakeA
, applyStep1
(with other argumentsB
,C
), and then applyStep2
with additional argumentD
, and then2 applyStep3
. Overall, this is much easier to read as the code reads sequentially from left to right, as the code “happens”. Contrast this with:step3(step2(step1(A,B,C),D))
.
NOTE: if you like, you can pipe arguments into the function at a specified position, using .
. For example, B %>% function(A, ., C)
is equivalent to calling function(A, B, C)
.
Wide-to-long: pivot_longer()
The function to transform wide data to long data is pivot_longer()
(Documentation). Its first argument, like a lot of the tidyverse functions, is the data frame that we want to manipulate, and is handled by the %>%
operator. Recall that the wide-form data looks like the following, with the columns we want to modify in bold:
Name | Gender | Height_Pri1 | Height_Pri2 | Height_Pri3 |
---|---|---|---|---|
Angela | Female | 116 | 120 | 127 |
Brian | Male | 110 | 116 | 123 |
Cathy | Female | 121 | 125 | 131 |
The second set of arguments to pivot_longer()
are the columns we want to modify. In this case, they are “Height_Pri1”, “Height_Pri2”, and “Height_Pri3”. Note that these need to be concatenated into a single vector, so use c(...)
to concatenate them.
The last two arguments that we need, which are named arguments, are names_to
and values_to
, which give the names of the ‘key’ column and the ‘value’ column in the output long-form data frame. Let’s try the following, where I pipe df_wide
into a pivot_longer()
command, and save the output to a new data frame called df_long_1
df_long_1 = df_wide %>%
pivot_longer(c("Height_Pri1", "Height_Pri2", "Height_Pri3"),
names_to = "Variable", values_to = "Height")
Let’s take a look at what this produces:
Name | Gender | Variable | Height |
---|---|---|---|
Angela | Female | Height_Pri1 | 116 |
Angela | Female | Height_Pri2 | 120 |
Angela | Female | Height_Pri3 | 127 |
Brian | Male | Height_Pri1 | 110 |
Brian | Male | Height_Pri2 | 116 |
Brian | Male | Height_Pri3 | 123 |
This looks close! Notice how the names_to
and values_to
arguments became the names of the columns? Note also how “Name” and “Gender” variables get copied automatically?
mutate()
The last step we want to do to clean this up is to rename the Variable
column and the variables in that column. For example, we want to change Height_Pri1
to something more readable, like maybe the number 1
. But instead of renaming, I want to introduce the function mutate()
which creates new variables.
mutate()
’s first argument is the data frame, which again is handled by%>%
mutate()
’s subsequent arguments follow the formatnew_variable = operation()
. You can also stack many such operations to create many variables at the same time. For example,mutate(newVar1 = operation(), newVar2 = operation(), newVar3 = ...)
Let’s use the factor()
operation to create a new factor using the values in the Variable
column. We specify the levels
of the factor as the original values of the Variable
column, and then we use labels
to rename what these values will be called in the new variable.
Specifically, let’s use mutate()
to make a new variable called School_Level
, which will just have values of 1, 2, 3 to refer to Primary 1, 2 or 3:
df_long_2 = df_long_1 %>%
mutate(School_Level = factor(Variable,
levels=c("Height_Pri1", "Height_Pri2", "Height_Pri3"),
labels=c(1, 2, 3)))
Name | Gender | Variable | Height | School_Level |
---|---|---|---|---|
Angela | Female | Height_Pri1 | 116 | 1 |
Angela | Female | Height_Pri2 | 120 | 2 |
Angela | Female | Height_Pri3 | 127 | 3 |
Brian | Male | Height_Pri1 | 110 | 1 |
Brian | Male | Height_Pri2 | 116 | 2 |
Brian | Male | Height_Pri3 | 123 | 3 |
Which is great, we now have School_Level
, which is a lot more readable than Variable
. Finally, we can use select()
to choose the names of the columns that we want to keep. Let’s just keep School_Level
instead of Variable
, since they have the same information (We can also use this function to re-order the columns; note the order in the following function call).
Which finally produces what we want:
Name | Gender | School_Level | Height |
---|---|---|---|
Angela | Female | 1 | 116 |
Angela | Female | 2 | 120 |
Angela | Female | 3 | 127 |
Brian | Male | 1 | 110 |
Brian | Male | 2 | 116 |
Brian | Male | 3 | 123 |
Thus, putting it all together, we can perform all the previous steps using:
df_long = df_wide %>%
pivot_longer(c("Height_Pri1", "Height_Pri2", "Height_Pri3"),
names_to = "Variable", values_to = "Height") %>%
mutate(School_Level = factor(Variable,
levels=c("Height_Pri1", "Height_Pri2", "Height_Pri3"),
labels=c(1, 2, 3))) %>%
select(Name, Gender, School_Level, Height)
## equivalent to:
#
# df_long_1 = df_wide %>% pivot_longer(c("Height_Pri1", "Height_Pri2", "Height_Pri3"), names_to = "Variable", values_to = "Height")
#
# df_long_2 = df_long_1 %>%
# mutate(School_Level = factor(Variable,
# levels=c("Height_Pri1", "Height_Pri2", "Height_Pri3"),
# labels=c(1, 2, 3)))
#
# df_long_3 = df_long_2 %>% select(Name, Gender, School_Level, Height)
#
Long-to-wide: pivot_wider()
Finally, let’s try to go backwards, from a long-form data frame to a wide-form data frame. Let’s assume we start with df_long
made from the previous section, and we want to spread it back to a wide format. The relevant function is pivot_wider()
. (Documentation)
- The first argument is the data frame, and is handled by
%>%
. - The second argument is the id columns; these are the variables that identify the observation. In this case, it is the
Name
andGender
columns, since they stick with each observation. - Then, we have
names_from
andvalues_from
, which specify the names of the columns that we want to take the values from to spread them out.
Let’s try:
df_wide_test1 = df_long %>% pivot_wider(
id_cols = c("Name", "Gender"),
names_from = "School_Level",
values_from = "Height")
Which gives us:
Name | Gender | 1 | 2 | 3 |
---|---|---|---|---|
Angela | Female | 116 | 120 | 127 |
Brian | Male | 110 | 116 | 123 |
Cathy | Female | 121 | 125 | 131 |
Great! We got back a wide-form data frame. But notice how the columns are now labelled 1
, 2
, 3
, and this makes it hard to understand what they mean? We’ll leave it as an exercise to the reader to try to convert these back into something more understandable. (There are several possible solutions!)
As it behaves similar to other pipe operators in Python and Unix, the creator pronounces
%>%
as “and then”: Source: https://community.rstudio.com/t/how-is-pronounced/1783/12↩︎That’s why some people call
%>%
“and then”↩︎