Today

By the end of today you will

Understand the language and notation of linear modeling.
Use tidymodels to make inference under a linear regression model

To begin, let’s load the data. Today we’ll work with the Palmer penguin dataset.

In particular, we will focus on three variables: bill length, flipper length and body mass of various penguins.

data(penguins)

penguins %>% 
  select(bill_length_mm, flipper_length_mm, body_mass_g) %>%
  glimpse()

## Rows: 344
## Columns: 3
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …

The language of linear modeling and the dimension of data

What is a linear model?

A linear model is a simple way to mathematically model the relationship between two or more observed measurements.

We designate one of our measurements to be an “outcome”, (e.g. body mass) and other observations (1 or more) to be “predictors”, e.g. bill length.

Simple example:

\[ \underbrace{y}_{\text{outcome}} = \beta_0 + \underbrace{x_1}_{\text{predictor}} \beta_1 \] Vocabulary:

\(y\): “outcome”, also called “response” or “dependent variable”
\(x\): “predictor” also called “regressors”, “independent variables”, “covariates”, “features”, “the data”
\(\beta_0\), \(\beta_1\): “constants” i.e. fixed numbers. These are population parameters. \(\beta_0\) often gets the special name “the intercept”.

The objective of linear regression is to find the best estimates for parameters (betas) for our purposes (fitting the data, prediction, etc.)

There’s no reason we couldn’t have more predictors. For example…

and associated linear model:

\[ y = \beta_0 + x_{1} \beta_1 + x_{2} \beta_2 \]

\(y\): body mass
\(x_1\): bill length
\(x_2\): flipper length

Building intuition for higher dimensions

offline example

2-D example

Exercise 1

Let’s find the equation of the line in the 2D example above (body mass vs bill length).

First, label your outcome and explanatory variable:

\(y\):
\(x\):

Next, modify the code from the prepare to fit a linear model to the data, this means using the data to estimate the \(\beta\)s

Note: no need to plot or mess with residuals here.

# code here

Exercise 2

Once we fit the model to our data, we have the equation:

\[ \hat{y} = \hat{\beta_0} + x_1 \hat{\beta_1} \] where the hats remind us we don’t know the true population parameter \(\beta\) that would result from fitting the entire population.

\(\hat{y}\) is the “predicted outcome” of our model.

Write the linear model out in \(x\), \(\hat{y}\) notation, replacing \(\hat{\beta}\) with the fitted constants you found from the previous exercise.

Exercise 3

Use the equation from the previous exercise to predict the body mass of a penguin with a bill length of 50 mm

# use R as a calculator here

If you were asked to predict the body mass of a penguin with a bill lenght of 70 mm, is this model appropriate? Why or why not?

Exercise 4

Five penguins in the data actually have a bill length of 50 mm. What are the five residuals associated with these observations?

Remember: a residual \(\epsilon\) is the difference between a fitted (predicted) value and the observed value from the data:

\[ \epsilon_i = \hat{y}_i - y_i \]

# code here

For next time …

Click here to interact with an ordinary least squares (OLS) linear regression model.

Select I and move the data points around.

Describe what you see.

Intro to Linear Models

March 18 2022

Bulletin