6.1 Basics of Logistic Regression
Let’s imagine that we have a dataset with individual yes/no decisions, that we want to predict. For example, we might have a dataset of individual consumer purchases on an e-commerce platform, where we want to predict a consumer’s decision to purchase a product based upon other variables such as how much they spend on the platform, what their demographics are7.
Here, our dependent variable is whether or not the customer purchased the product, so just “Yes” or “No”, and we can write the model with this Purchased variable on the left hand side. Can we build a linear model to predict purchasing behaviour?
\[\text{Purchased} \sim \text{Spending} + \text{Demographics} + \ldots\]
Unfortunately, here we cannot use linear regression, because our response variable is a binary (yes/no) variable, while the linear regression model assumes a variable with normally distributed errors. But we can use the generalized linear model
, which, as its name suggests, generalizes the linear model to other types of variables. The idea is that the GLM introduces a link function that maps the actual response variable Y
to what the linear model predicts.
Here we’ll focus on logistic regression
, which uses the logit
function as its link function.
Let \(p\) be the probability that Purchase = 1 (i.e., “Yes”)8. Then the logistic regression equation becomes:
\[\text{logit}(p) = b_0 + b_1 X_1 + \ldots\]
Thus, instead of predicting a continuous outcome variable \(Y\) directly, we instead predict the log-odds of an event occurring:
\[ \text{logit}(p) \equiv \log \frac{p}{1-p}\]
This term is called the log-odds, where \(\frac{p}{1-p}\) is called the “odds” or “odds-ratio”. To be clear, the logit is defined using the natural (base-\(e\)) logarithm, not the base-10 logarithm.
The difference between the equation for linear regression (in the previous chapter) and logistic regression is summarized in the following table:
Name | Equation |
---|---|
Linear Regression | \[Y = b_0 + b_1 X_1 + \ldots\] |
Logistic Regression | \[\text{logit}(p) = b_0 + b_1 X_1 + \ldots\] |
In fact, linear regression is a special case of the general linear model with the identity function as the link function.
Another common example: we might want to predict people’s voting behaviour or willingness to support certain policies, based upon certain characteristics.↩︎
In general, \(p\) is the probability that the dependent variable takes on a particular value of interest (e.g., “success”). R treats binary outcome variables as factors, where the first level (e.g.
0
,FALSE
,No
) is the base, comparison group and the second level (e.g.,1
,TRUE
,Yes
) is the “success” group. As with categorical independent variables, you can uselevels(df$var)
to check which level R will use as the base group by default.↩︎