5.4 Interpreting the output of a regression model

In this section we’ll be going over the different parts of the linear model output. First, we’ll talk about the coefficient table, then we’ll talk about goodness-of-fit statistics.

Let’s re-run the same model from before:

fit1 <- lm(Y~X, df1)
summary(fit1)
## 
## Call:
## lm(formula = Y ~ X, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0781 -0.5736  0.1260  0.3071  1.5452 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.26117    0.46171  -4.897 0.000851 ***
## X            2.10376    0.07804  26.956 6.44e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8185 on 9 degrees of freedom
## Multiple R-squared:  0.9878, Adjusted R-squared:  0.9864 
## F-statistic: 726.6 on 1 and 9 DF,  p-value: 6.442e-10

First, summary() helpfully reiterates the formula that you put in. This is useful to check that it’s running what you thought it ran.

Call:
lm(formula = Y ~ X, data = df1)

It also tells you the minimum, 1st quantile (25%-ile), median, 3rd quantile (75%-ile), and maximum of the residuals (\(e_i = Y_i - \hat{Y_i}\)). That is, the minimum residual error of this model is -1.0781, the median residual error is 0.1260, and the maximum is 1.5452.

Residuals:
    Min      1Q  Median      3Q     Max 
-1.0781 -0.5736  0.1260  0.3071  1.5452 

5.4.1 The coefficient table

Let’s turn next to the coefficient table.

summary(fit1)$coeff
##              Estimate Std. Error   t value    Pr(>|t|)
## (Intercept) -2.261167 0.46170984 -4.897376 8.50721e-04
## X            2.103757 0.07804321 26.956313 6.44250e-10

Let’s focus on the “Estimate” column. These are the point estimate of \(b_0\) and \(b_1\), for the equation \[Y= b_0 + b_1 X\]

What do these numbers mean?

\(b_0\): The mean value of \(Y\) when \(X\) is zero

The meaning of the intercept, \(b_0\), is pretty straightforward. It is the average value of the dependent variable \(Y\) when the independent variable \(X\) is set to 0. (Graphically, it is the vertical intercept: the point at which the line crosses the vertical axis.)

\(b_1\): According to the model, a one-unit change in \(X\) results in a \(b_1\)-unit change in \(Y\)

The coefficient on \(X\), \(b_1\), captures the magnitude of change in \(Y\), per unit-change in \(X\). Graphically, this is the slope of the regression line; if \(b_1\) is larger, the line will have a steeper slope. Conversely, if \(b_1\) is smaller in magnitude, the line will have a more shallow slope. If \(b_1\) is positive, the slope will slope upwards /, otherwise if \(b_1\) is negative, the slope will go downwards \.

Example: Interpreting Simple Regression Coefficients

Let’s go through an example. Let’s say we fit a model to predict our monthly profit given the amount that we spent on advertising. Both Profit and Expenditure are measured in $.

\[\text{Profit} = -2500 + 3.21* \text{ExpenditureOnAdvertising}\]

Coefficient Interpretation
(\(b_0\)) Monthly profit is -$2500 without any money spent on advertising.
(\(b_1\)) For every dollar spent on Advertising, Profit increases by $3.21

Q: Why could profit be negative here?

Negative (or otherwise unusual) intercepts arise all the time in linear regression. In this example, this just means that, if we spent $0 on advertising, we would still incur a negative profit of $2,500, which could be due to omitted variables such as the amount we have to spent on rent, wages, and other upkeep.

Note that it is very important to be aware of the units that each of the variables, both \(Y\) and \(X\), are measured in. This will ensure accurate interpretation of the coefficients!

The rest of the coefficient table

The estimated value of \(b_0\) and \(b_1\) are given in the first column (Estimate) of the coefficient table. Next to the estimates, we have the standard error of \(b_0\) and \(b_1\), which gives us a sense of the error associated with our estimate.

In the third column, we have the t value. This is the t-statistic for a one-sample t-test comparing this coefficient to zero. That is, it is the one-sample t-test for the null hypothesis that the coefficient is zero, against the alternative, two-sided hypothesis that it is not zero: \[ H_0: b_j = 0 \\ H_1: b_j \neq 0 \]

In fact, the t value here, is simply the Estimate divided by the Standard Error. (You can check it yourself!) So with this t value, and the degrees of freedom of the model, we can actually calculate the p value for such a t test. R helpfully does this for you, and this is given in the fourth column, Pr(>|t|). We can see that these numbers in this example are quite small, so both \(b_0\) and \(b_1\) are statistically different from zero.

To the right of the Pr(>|t|) column, R will helpfully print out certain significance codes.

  • If \(p\) is between 0.1 and 0.05, R will print a ..
  • If \(p\) is less than 0.05 (\(\alpha\)=5% level of significance) but greater than 0.01 (1%), R will print a single *.
  • If \(p\) is less than 0.01 but greater than 0.001, R will print out two asteriks, **.
  • Finally, if \(p\) is less than 0.001, R will print out three asterisks, ***.

5.4.2 Goodness-of-fit statistics

Finally we’ll look at the last part of the summary output.

## Residual standard error: 0.8185 on 9 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9878, Adjusted R-squared:  0.9864 
## F-statistic: 726.6 on 1 and 9 DF,  p-value: 6.442e-10

First, note that R will helpfully print out whether or not there were observations missing in our data.

(1 observation deleted due to missingness)

If, for any data point, either the \(X\) value, or the \(Y\) value (or both) are missing, then R will remove that observation from the linear model, and report it in the output. This is always something useful to check: do we have an abnormally large number of missing observations that we not expect? For example, perhaps one of the variables has a large number of missing observations? Or maybe when we were calculating new variables, we did not consider certain situations, and so end up with a lot of missing variables. (Or maybe we made a typo in our code!). This is always a good safety check before proceeding further. (Note that if there are no missing observations, R will omit this line).

Next, we’ll discuss a very important statistic, called the coefficient of determination, or \(R^2\) (“R-squared”), which is a measure of the proportion of variance explained by the model. \(R^2\) is a number that always lies between 0 and 1. An \(R^2\) of 1 means it’s a perfect model, it explains all of the variance (all the data points lie on the line. Alternatively, all the residuals are 0).

The total amount of variability of the data is captured in something called the Total Sum of Squares, which is the sum of the difference between each data point \(Y_i\) and the mean \(\bar{Y}\) (this is also related to the variance of \(Y\)): \[\begin{align} \text{Total Sum of Squares} \equiv \sum_i \left(Y_i - \bar{Y} \right)^2 \end{align}\]

The amount of variability that is explained by our model (which predicts \(\hat{Y}\)) is given by the Regression Sum of Squares, which is the sum of the squared error between our model predictions and the mean \(\bar{Y}\):

\[\begin{align} \text{Regression Sum of Squares} \equiv \sum_i \left(\hat{Y_i} - \bar{Y} \right)^2 \end{align}\]

And finally, the leftover amount of variability, also called the Residual Sum of Squares, is basically the difference between our model predictions \(\hat{Y}\) and the actual data points \(Y\). This was the term that Ordinary Least Squares regression tries to minimize, which we saw in the last Section.

\[\begin{align} \text{Residual Sum of Squares} \equiv \sum_i \left(Y_i - \hat{Y_i} \right)^2 \end{align}\]

As it turns out, the Total Sum of Squares is made up of these two parts: the Regression Sum of Squares (or “Explained” Sum of Squares), and the Residual Sum of Squares (or the “Unexplained” Sum of Squares).

\[\begin{align} \text{Total Sum of Squares} \equiv \text{Regression Sum of Squares} + \text{Residual Sum of Squares} \end{align}\]

\(R^2\) basically measures the proportion of explained variance over the total variance. In other words:

\[\begin{align} R^2 &\equiv \frac{\text{Regression Sum of Squares}}{\text{Total Sum of Squares}} \\ &\equiv 1 - \frac{\text{Residual Sum of Squares}}{\text{Total Sum of Squares}} \end{align}\]

You can read off the \(R^2\) value from the field indicated by “Multiple R-squared”, i.e.,

## Multiple R-squared: 0.9878, …

In the output above, the \(R^2\) is 0.9878; this means that this model explains 98.8% of the variance. That’s really high!

Now, how good is a good \(R^2\)? Unfortunately there’s no good answer, because it really depends on contexts. In some fields and in some contexts, even an \(R^2\) of .10 to .20 could be really good. In other fields, maybe we would expect \(R^2\)s of .80 or .90!