2.1 Basics of Data Wrangling
Data wrangling, or data munging, is the process of “wrangling” data from one format to another.
Let us imagine that you are interested in the heights of every student when they enter school for the first time at Primary 1. You spend time recording the heights and gender of each student. One very obvious format to store the data is to record your results in a three-column table, with name, gender, and height (in centimeters) being the three columns.
Name | Gender | Height |
---|---|---|
Angela | Female | 116 |
Brian | Male | 110 |
Cathy | Female | 121 |
This is called wide-form
data, where each observation is a single row, and each variable is in one column. This is a very common way of representing data, and is very useful for plotting variables. For example, when we have data in this format, we can easily call a function to represent the heights of Male vs. Female students using a boxplot, like below (see Data Visualization chapter for more on the boxplot and other visualizations).
Now, imagine that the following two years you get a chance to repeat the study with the same set of students. You collect their heights again at Primary 2, and at Primary 3. You decide to add two new columns, as follows:
Name | Gender | Height_Pri1 | Height_Pri2 | Height_Pri3 |
---|---|---|---|---|
Angela | Female | 116 | 120 | 127 |
Brian | Male | 110 | 116 | 123 |
Cathy | Female | 121 | 125 | 131 |
This is still wide-form
data, as each student’s data is still only in one row, and you have merely added two new variables as new columns.
With this wide-form data, you could still do things like calculate the correlation between Height_Pri1
and Height_Pri2
using cor(df_wide$Height_Pri1, df_wide$Height_Pri2)
, or plot a scatterplot between their heights in Primary 1 and Primary 3, like so:
But now if you wanted to do something like look at the change of height over time, this representation may not be the easiest to do that type of analysis. Ideally we would want a case where each row had Primary School Level
and Height
, perhaps like:
Name | Gender | School_Level | Height |
---|---|---|---|
Angela | Female | 1 | 116 |
Angela | Female | 2 | 120 |
Angela | Female | 3 | 127 |
Brian | Male | 1 | 110 |
Brian | Male | 2 | 116 |
Brian | Male | 3 | 123 |
This is called long-form
data, where each response is in one row. This also means that each student, because they were measured three times, would appear in three rows. If you notice above, the student Angela, appears three times.
A more general definition of long-form
is one where each row contains a key-value pair. The value
is the actual value of the variable (e.g., the value of “Height” in centimeters), while the key
gives you the variables that are associated with that value (e.g. the “School Level” that the height was measured at).
This data representation makes it easier to do other kinds of modelling, for example to model how Height increases over time with School Level. It also allows easier plotting such as the one below, whereby each student’s height trajectory is plotted (as separate lines) against time on the horizontal axis. (Note that this is an example of longitudinal or repeated measures data, which requires advanced multilevel modelling to properly model.)
In the next section we’ll learn a little bit about the functions needed to move between wide and long-form data.