2.1 Basics of Data Wrangling

Data wrangling, or data munging, is the process of “wrangling” data from one format to another.

Let us imagine that you are interested in the heights of every student when they enter school for the first time at Primary 1. You spend time recording the heights and gender of each student. One very obvious format to store the data is to record your results in a three-column table, with name, gender, and height (in centimeters) being the three columns.

Name Gender Height
Angela Female 116
Brian Male 110
Cathy Female 121

This is called wide-form data, where each observation is a single row, and each variable is in one column. This is a very common way of representing data, and is very useful for plotting variables. For example, when we have data in this format, we can easily call a function to represent the heights of Male vs. Female students using a boxplot, like below (see Data Visualization chapter for more on the boxplot and other visualizations).

Now, imagine that the following two years you get a chance to repeat the study with the same set of students. You collect their heights again at Primary 2, and at Primary 3. You decide to add two new columns, as follows:

Name Gender Height_Pri1 Height_Pri2 Height_Pri3
Angela Female 116 120 127
Brian Male 110 116 123
Cathy Female 121 125 131

This is still wide-form data, as each student’s data is still only in one row, and you have merely added two new variables as new columns.

With this wide-form data, you could still do things like calculate the correlation between Height_Pri1 and Height_Pri2 using cor(df_wide$Height_Pri1, df_wide$Height_Pri2), or plot a scatterplot between their heights in Primary 1 and Primary 3, like so:

But now if you wanted to do something like look at the change of height over time, this representation may not be the easiest to do that type of analysis. Ideally we would want a case where each row had Primary School Level and Height, perhaps like:

Name Gender School_Level Height
Angela Female 1 116
Angela Female 2 120
Angela Female 3 127
Brian Male 1 110
Brian Male 2 116
Brian Male 3 123

This is called long-form data, where each response is in one row. This also means that each student, because they were measured three times, would appear in three rows. If you notice above, the student Angela, appears three times. A more general definition of long-form is one where each row contains a key-value pair. The value is the actual value of the variable (e.g., the value of “Height” in centimeters), while the key gives you the variables that are associated with that value (e.g. the “School Level” that the height was measured at).

This data representation makes it easier to do other kinds of modelling, for example to model how Height increases over time with School Level. It also allows easier plotting such as the one below, whereby each student’s height trajectory is plotted (as separate lines) against time on the horizontal axis. (Note that this is an example of longitudinal or repeated measures data, which requires advanced multilevel modelling to properly model.)

In the next section we’ll learn a little bit about the functions needed to move between wide and long-form data.