Chapter 5 The Linear Model I: Linear Regression

In the next few Chapters, we will learning about a workhorse tool of analytics: the linear model, which allows us to run linear regression models that we can use to estimate simple trends and look at how some variables in our data may affect other variables. The linear model is a key tool used in many fields, from psychology, economics, linguistics, business, as well as in some physical sciences like ecology. Thus, it is an essential part of the data scientist’s toolkit.

In this Chapter, we will be going over linear regression with one or more independent variables, to predict a continuous dependent variable. We will also discuss how to interpret the output of a simple regression model, and discuss the case of categorical independent variables.

The learning objectives for this chapter are:

  • Readers should be able to understand and be able to use both simple and multiple regression to estimate simple trends.

  • Readers should be able to interpret the output of a regression model, including regression coefficients, confidence intervals, hypothesis testing about coefficients, and goodness-of-fit statistics.

  • Readers should be able to interpret the meaning of dummy-coded variables used to model categorical independent variables.

# Load the libraries we'll use in this chapter
library(tidyverse) 

I’ve also made up some simulated data just for the next few sections that we will use to illustrate linear regression. Below is the code I used to generate df1: \(X\) is just a vector from 0 to 10, and \(Y\) is an affine transformation of \(X\) with some random noise added.

set.seed(1)
df1 = data.frame(X=c(seq(0,10)))
df1$Y = -2 + df1$X*2 + rnorm(n=nrow(df1), mean=0, sd=1)