5.8 Categorical Independent Variables

So far we have been dealing with continuous independent variables (\(X\)), (e.g. Expenditure, Years, Age, Numbers, …). In this section, we consider categorical independent variables (e.g., Gender, Ethnicity, MaritalStatus, Color-Of-Search-Button, …).

Let’s consider an example modelling how Umbrella Sales depend upon Weather.

\[\text{UmbrellaSales} = b_0 + b_1 \text{Weather}\]

These categorical variables take on one of a small set of fixed values. Let’s assume in this simple world that weather is only Sunny or Rainy.

5.8.1 Dummy Coding

Dummy Coding (the default method in R) is a method by which we create and use dummy variables in our regression model5.

In this example, we can define a variable: Rainy that is 1 if Weather==Rainy, and 0 if Weather==Sunny.

Rainy is called a dummy variable (sometimes called an indicator variable)

We can replace Weather with the dummy variable Rainy:

\[\text{UmbrellaSales} = b_0 + b_1 \text{Weather} \; \rightarrow \; \color{brown}{\text{UmbrellaSales} = b_0 + b_1 \text{Rainy}}\]

Thus, this breaks down into two equations (technically, a piecewise equation):

  • If Sunny, \(\text{UmbrellaSales} = b_0 + b_1(0) = b_0\)
  • If Rainy, \(\text{UmbrellaSales} = b_0 + b_1(1) = b_0 + b_1\)

Now we can interpret the value of these coefficients. Looking at the first equation, we can see that \(b_0\) is simply the average umbrella sales when it is sunny. And similarly, from the second equation, we see that \(b_0\) PLUS \(b_1\) is the average umbrella sales when it is rainy.

This means that \(b_1\) is the difference between these equations: It is the average difference in umbrella sales when it is rainy, compared to when it is sunny.

The following table summarizes these interpretations:

Coefficient Intepretation:
(\(b_0\)) Average Umbrella sales when it is Sunny
(\(b_0+b_1\)) Average Umbrella sales when it is Rainy
(\(b_1\)) Average difference in Umbrella sales when it is Rainy, compared to when it is Sunny
(Sales when Rainy - Sales when Sunny)
Q: Would you predict $b_1$ to be greater than 0 or less than 0?

5.8.2 Dummy Coding with 3 levels

Let’s now consider a more complicated world in which Weather can take one of three values: Sunny, Rainy, or Cloudy.

Then we can define two dummy variables, Rainy and Cloudy,

  • Rainy = 1 if weather is rainy, 0 otherwise
  • Cloudy = 1 if weather is cloudy, 0 otherwise

We say that Sunny is the Reference Group for the categorical variable Weather.

\[\text{UmbrellaSales} = b_0 + b_1 \text{Rainy} + b_2\text{Cloudy}\]

This breaks down into three equations:

  • If Sunny, \(\text{UmbrellaSales} = b_0 + b_1(0) + b_2(0) = b_0\)
  • If Rainy, \(\text{UmbrellaSales} = b_0 + b_1(1) + b_2(0) = b_0 + b_1\)
  • If Cloudy, \(\text{UmbrellaSales} = b_0 + b_1(0) + b_2(1) = b_0 + b_2\)

Just like above, we can interpret the meaning of the coefficients in the following table:

Coefficient Intepretation:
(\(b_0\)) Average Umbrella sales when it is Sunny
(\(b_0+b_1\)) Average Umbrella sales when it is Rainy
(\(b_1\)) Average difference in Umbrella sales when it is Rainy, compared to when it is Sunny
(Sales when Rainy - Sales when Sunny)
(\(b_0+b_2\)) Average Umbrella sales when it is Cloudy
(\(b_2\)) Average difference in Umbrella sales when it is Cloudy, compared to when it is Sunny
(Sales when Cloudy - Sales when Sunny)

Thus, in general, a categorical variable with \(n\) levels will have \((n-1)\) dummy variables. And the general interpretation of these \(i\) dummy variables are:

Coefficient Intepretation:
(\(b_0\)) Average Value of Y for the reference group.
(\(b_i\)) Average Difference in Y for Dummy Group i compared to the reference group.

5.8.3 The Reference Group

Now, when we do dummy coding, one of the groups will automatically be the reference group. The choice of reference group is not fixed. And choosing your reference group well (depending on your goals) will make your analyses more convenient and interpretable.

For example, for the Umbrella Sales example (Sunny, Rainy, Cloudy), I think “Sunny” is a good reference group. Why?

Or let’s say I want to see how well people react to the color of the button on my webpage. So I run an experiment with the following four buttons6:

  • Current Button
  • Button A
  • Button B
  • Button C

Which should I choose to be my reference group? I think that “Current Button” should be the reference group, since that is the status quo and I am interesting in how changing the buttons would affect click-through, relative to my current button.

The good news: R handles dummy coding for you using factors.
You do not need to create your own dummy variables. Just run:
lm(sales ~ weather, df)
and if weather is a factor with n levels,
R will default to creating n-1 dummy variables 


The bad news: R does not know your hypotheses, so it uses a heuristic for 
choosing the reference group.
If you do not specify, R defaults to ranking the groups by alphabetical order.

Thus, in the weather example, it would choose "Cloudy", 
and in the button example, "Button A" as the reference group.

If your variable (df$var) is a factor, you can check by using
levels(df$var). 
The first level will be the reference group.

Use relevel(df$var, "desiredReferenceLevel") to adjust the reference group.

(if df$var is a character string, levels() will return NULL,
but if you put it into a lm(), R will treat it as categorical variable,
with the alphabetically smallest string as the reference group)

5.8.4 Interpreting categorical and continuous independent variables

Whenever we have categorical independent variables in a model, interpreting the coefficients has to be done with respect to the reference group, even for other continuous independent variables.

Let’s add a continuous variable to our 3-weather model.

  • UmbrellaSales is a continuous variable measured in $.
  • Rainy and Cloudy are dummy variables (“Sunny” is the reference group)
  • ExpenditureOnAdvertising is also a continuous variable measured in $.

Let’s say we fit the following model and obtain the following coefficients:

\[\text{UmbrellaSales} = 10 + 50 \text{Rainy} + 20 \text{Cloudy} + 2.5\text{ExpenditureOnAdvertising}\]

Here’s how we interpret each of these coefficients:

Coefficient Intepretation:
10 Average umbrella sales when it is sunny and $0 spent on advertising.
50 Average umbrella sales when it is rainy compared to sunny and $0 spent on advertising.
20 Average umbrella sales when it is cloudy compared to sunny and $0 spent on advertising.
2.5 When it is sunny, every dollar spent on advertising increases sales by $2.50

  1. Aside from dummy coding (R’s default), there are other coding schemes which can be used to test more specific hypothesis, e.g. effect coding, difference coding, etc. But here we shall focus on Dummy Coding.↩︎

  2. This may seem frivolous, but Google actually ran a lot of these tests back in the day, with different shades of blue/green/red to settle on their current “Google colors”.↩︎