library(tidyverse)
library(tidymodels)
library(scatterplot3d)
library(viridis)

Bulletin

Today

By the end of today, you will…

We’ll continue examining the Palmer penguin dataset.

data(penguins)

Use ?penguins or click here for more info about the dataset.

Last time

One predictor model

Recall that y is the actual observed outcome (from the data) and ˆy is the predicted outcome from the fitted model.

The fitted one predictor linear model:

ˆy=^β0+^β1x1

where ˆβ are the fitted estimates of the true parameters β that could be computed if we had the entire population data.

  • How did we find ^β0 and ^β1?

We used the function linear_reg with the lm engine to fit the model. From the documentation, the ‘lm’ engine “uses ordinary least squares to fit models with numeric outcomes.”

Ordinary Least Squares

The objective of ordinary least squares regression is to find the ˆβ that minimize the sum of square residuals,

ni=1ϵ2i

where n is the number of observations (rows in the data) and

ϵi=yi^yi

Side note: one can show that minimizing the sum of square residuals yields the best possible ˆβs when certain assumptions are made, e.g. when ϵiNormal

Exercise 1

Click here to play with the interactive example

Describe what you see.

R2

Once we fit a model according to the least squares criteria above, how do we assess how well our predictors explain the outcome? We can use a statistic called R2

Conceptualize R2

Math definition:

R2=1niϵ2ini(yiˉy)2

Word definition:

R2=1sum of squared errorsum of square distance from mean in data

Let’s focus on the second term to build intuition.

  • The numerator “sum of squared error” is a measure of how wrong our model is (the amount of variability not explained by the model)

  • The denominator is proportional to the variance i.e. the amount of variability in the data.

  • Together, the fraction represents the proportion of variability not explained by the model.

If the sum of squared error is 0, then the model explains all variability and R2=10=1.

If the proportion of error not explained is 1, i.e. the sum of squared error is the same as all the variability in the data, then model does not explain any variability and R2=11=0.

Final take-away: R2 is a measure of the proportion of variability the model explains. An R2 of 0 is a poor fit and R2 of 1 is a perfect fit.

Example: R2

A single predictor model: flipper length explains body mass.

bm_flipper_fit = linear_reg() %>%
set_engine("lm") %>%
fit(body_mass_g ~ flipper_length_mm, data = penguins)

glance(bm_flipper_fit) %>%
  select(r.squared)
ABCDEFGHIJ0123456789
r.squared
<dbl>
0.7589925

Exercise 2

Next, build a single predictor model with bill length as the predictor of body mass. Compare the R2 to the model above.

# code here

Exercise 3

Next, build a multiple regression model with two predictors of body mass: bill length and flipper length.

  • Before writing any code, do you think R2 will increase, decrease or stay the same? Why?
# code here
  • Report R2 and discuss

Hypothesis testing in a regression framework

This is the model from the previous exercise:

y=β0+x1β1+x2β2

y: body mass (g) x1: flipper length (mm) x2: bill length (mm)

We didn’t see a big increase in R2 when adding bill length as a second predictor of body mass.

Let’s conduct a hypothesis test in a regression framework. If bill length does not help us explain body mass, then β2 might as well be 0. Within the framework of hypothesis testing:

H0: β2=0

HA: β20

For OLS regression, our test statistic is

T=ˆβ0SEˆβtn2

We want to see if our observed statistic, ˆT, falls far in the tail under the null.

R takes care of much of this behind the scenes with the tidy output and reports a p-value for each β by default.

Exercise 4

Display your model below in tidy format. Compare the p-value to one you calculate manually using the equation above.

# code here
# code here

Is βbill length significant at the α=0.05 level? State your conclusion.

Next time

Interactions

penguins %>%
  ggplot(aes(x = bill_length_mm, y = body_mass_g, color = island)) +
  geom_point() + 
  theme_bw() + 
  geom_smooth(method = 'lm', se = F) + 
  labs(x = "Bill length (mm)", y = "Body mass (g)", title = "Body mass vs bill length by island", color = "Island")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

  • What’s going on in the figure above?