library(tidyverse)
library(tidymodels)
library(scatterplot3d)
library(viridis)

Bulletin

Today

By the end of today, you will…

We’ll continue examining the Palmer penguin dataset.

data(penguins)

Use ?penguins or click here for more info about the dataset.

Last time

One predictor model

Recall that \(y\) is the actual observed outcome (from the data) and \(\hat{y}\) is the predicted outcome from the fitted model.

The fitted one predictor linear model:

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x_1 \] where \(\hat{\beta}\) are the fitted estimates of the true parameters \(\beta\) that could be computed if we had the entire population data.

  • How did we find \(\hat{\beta_0}\) and \(\hat{\beta_1}\)?

We used the function linear_reg with the lm engine to fit the model. From the documentation, the ‘lm’ engine “uses ordinary least squares to fit models with numeric outcomes.”

Ordinary Least Squares

The objective of ordinary least squares regression is to find the \(\hat{\beta}\) that minimize the sum of square residuals,

\[ \sum_{i=1}^n \epsilon_i ^2 \] where \(n\) is the number of observations (rows in the data) and

\[ \epsilon_i = y_i - \hat{y_i} \]

Side note: one can show that minimizing the sum of square residuals yields the best possible \(\hat{\beta}\)s when certain assumptions are made, e.g. when \(\epsilon_i \sim Normal\)

Exercise 1

Click here to play with the interactive example

Describe what you see.

\(R^2\)

Once we fit a model according to the least squares criteria above, how do we assess how well our predictors explain the outcome? We can use a statistic called \(R^2\)

Conceptualize \(R^2\)

Math definition:

\[ R^2 = 1 - \frac{\sum_i^n \epsilon_i^2}{\sum_i^n (y_i - \bar{y})^2} \]

Word definition:

\[ R^2 = 1 - \frac{\text{sum of squared error}}{\text{sum of square distance from mean in data}} \]

Let’s focus on the second term to build intuition.

  • The numerator “sum of squared error” is a measure of how wrong our model is (the amount of variability not explained by the model)

  • The denominator is proportional to the variance i.e. the amount of variability in the data.

  • Together, the fraction represents the proportion of variability not explained by the model.

If the sum of squared error is 0, then the model explains all variability and \(R^2 = 1 - 0 = 1\).

If the proportion of error not explained is \(1\), i.e. the sum of squared error is the same as all the variability in the data, then model does not explain any variability and \(R^2 = 1 - 1 = 0\).

Final take-away: \(R^2\) is a measure of the proportion of variability the model explains. An \(R^2\) of 0 is a poor fit and \(R^2\) of 1 is a perfect fit.

Example: \(R^2\)

A single predictor model: flipper length explains body mass.

bm_flipper_fit = linear_reg() %>%
set_engine("lm") %>%
fit(body_mass_g ~ flipper_length_mm, data = penguins)

glance(bm_flipper_fit) %>%
  select(r.squared)

Exercise 2

Next, build a single predictor model with bill length as the predictor of body mass. Compare the \(R^2\) to the model above.

# code here

Exercise 3

Next, build a multiple regression model with two predictors of body mass: bill length and flipper length.

  • Before writing any code, do you think \(R^2\) will increase, decrease or stay the same? Why?
# code here
  • Report \(R^2\) and discuss

Hypothesis testing in a regression framework

This is the model from the previous exercise:

\[ y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 \]

\(y\): body mass (g) \(x_1\): flipper length (mm) \(x_2\): bill length (mm)

We didn’t see a big increase in \(R^2\) when adding bill length as a second predictor of body mass.

Let’s conduct a hypothesis test in a regression framework. If bill length does not help us explain body mass, then \(\beta_2\) might as well be 0. Within the framework of hypothesis testing:

\(H_0\): \(\beta_2 = 0\)

\(H_A:\) \(\beta_2 \neq 0\)

For OLS regression, our test statistic is

\[ T = \frac{\hat{\beta} - 0}{\text{SE}_{\hat{\beta}}} \sim t_{n - 2} \] We want to see if our observed statistic, \(\hat{T}\), falls far in the tail under the null.

R takes care of much of this behind the scenes with the tidy output and reports a p-value for each \(\beta\) by default.

Exercise 4

Display your model below in tidy format. Compare the p-value to one you calculate manually using the equation above.

# code here
# code here

Is \(\beta_{\text{bill length}}\) significant at the \(\alpha = 0.05\) level? State your conclusion.

Next time

Interactions

penguins %>%
  ggplot(aes(x = bill_length_mm, y = body_mass_g, color = island)) +
  geom_point() + 
  theme_bw() + 
  geom_smooth(method = 'lm', se = F) + 
  labs(x = "Bill length (mm)", y = "Body mass (g)", title = "Body mass vs bill length by island", color = "Island")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

  • What’s going on in the figure above?