6.2 Running a Logistic Regression

The syntax for running a logistic regression is almost the same as a linear regression, just that the call is glm() for general linear model, with an additional specification of family = binomial, which tells glm to run logistic regression. (Other family options produce other types of general linear regression, such as probit regression, etc.)

Now, the good news is that R handles a lot of this complication for you, when it can. For example, we do not have to manually calculate the odds ourselves. All we have to do is make sure our variable is a binary factor, then we can just call glm().

Note that, just like categorical IVs, when we do logistic regression with a categorical DV, we also have a reference group, so do use the levels() function to check.

Let’s generate a simple dataset with two independent variables \(X_1\) and \(X_2\), and use them to predict \(\text{Purchase}\), a binary yes/no variable.

# logistic
set.seed(1)
df2 = data.frame(x1=rnorm(20,0,5) + seq(20,1),
                 x2=rnorm(20,5,3),
                 Purchase = factor(rep(c("Yes", "No"), each=10)),
                                   levels=c("No", "Yes"))
levels(df2$Purchase) 
## [1] "No"  "Yes"
## This means that "No" is the `base group`, and `p` is the probability of "Yes".
# next, running a logistic regression via a general linear model
fit_log1 <- glm(Purchase ~ x1 + x2, family="binomial", df2)
summary(fit_log1)
## 
## Call:
## glm(formula = Purchase ~ x1 + x2, family = "binomial", data = df2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.96587  -0.37136   0.00399   0.54011   1.64780  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -4.0054     2.4320  -1.647   0.0996 .
## x1            0.3954     0.1653   2.393   0.0167 *
## x2           -0.1281     0.3062  -0.418   0.6758  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 27.726  on 19  degrees of freedom
## Residual deviance: 14.597  on 17  degrees of freedom
## AIC: 20.597
## 
## Number of Fisher Scoring iterations: 6

The summary output looks almost the same too as a lm() call. Let’s focus on the coefficient table, and reproduce the regression equation to help us in the interpretation.

\[\text{logit}(p) \equiv \log\frac{p}{1-p} = b_0 + b_1 X_1 + b_2X_2\]

Just like in linear regression, \(b_0\) is the mean value of the left-hand-side when all the \(X\) on the right hand side are zero. Hence, \(b_0\) is the log-odds of the event occurring when \(X_1\) and \(X_2\) are both zero (this part is the same as what we’ve covered previously). So this means that when \(X_1\) and \(X_2\) are both zero, the log-odds of purchasing an item is -4.005. Conversely, we can also say that the odds of purchasing this item is exp(-4.005), or 0.018. This means that \(\frac{p}{1-p}\) is 0.018.

Next, let’s move onto \(b_1\). \(b_1\) is the expected increase in log-odds per unit increase of \(X_1\), holding \(X_2\) constant. This is the same as linear regression.

And similarly, \(b_2\) is the expected increase in log-odds per unit increase of \(X_2\), holding \(X_1\) constant. Note that \(b_2\) is negative, so increasing \(X_2\) will decrease the log-odds. (But in this case, it’s not significant anyway)

Now, let’s take some numbers to build intuition. Every unit-increase of \(X_1\) increases the log-odds by 0.3954. Conversely, every unit-increase of \(X_1\) multiplies the odds by exp(0.3954) = 1.48. i.e., the odds increase by 48%

  • Check: When \(X_1\) and \(X_2\) are 0, the odds are exp(-4.005) = 0.0182.
  • If we now increase \(X_1\) to 1, the odds are now exp(-4.005 + 0.395), or 0.0271.
  • The odds have increased by (0.0271-0.0182)/0.0182 ~ 48% (if we kept more decimal places)

For example, if \(X_1\) is “Number of A-list celebrities endorsing your product”, then getting one additional celebrity endorsement would, in expectation, increase each customer’s odds of purchasing your product by 48%. (not increasing probability, but odds)

This table summarizes the interpretations:

Coefficient Interpretation
(\(b_0\)) Log-odds when \(X_1\) and \(X_2\) are both zero. Odds of purchasing = exp(-4.0054) = 0.018
(\(b_1\)) Expected increase in log-odds of event per unit-increase of \(X_1\), holding \(X_2\) constant.
(\(b_2\)) Expected increase in log-odds of event per unit-increase of \(X_2\), holding \(X_1\) constant.

The rest of the coefficient table is the similar to the lm(). However, instead of these coefficients following a \(t\) distribution, they follow a \(z\) distribution. The interpretation of the standard error, \(z\) values, and \(p\) values are similar. Thus, in this table, the coefficient \(X_1\) is statistically significant at the \(\alpha=0.05\) confidence level (\(p<.05\))