library(tidyverse)
library(tidymodels)
library(scatterplot3d)
library(viridis)
By the end of today, you will…
We’ll continue examining the Palmer penguin dataset.
data(penguins)
Use ?penguins
or click here for more info about the dataset.
Recall that y is the actual observed outcome (from the data) and ˆy is the predicted outcome from the fitted model.
The fitted one predictor linear model:
ˆy=^β0+^β1x1
We used the function linear_reg
with the lm
engine to fit the model. From the documentation, the ‘lm’ engine “uses ordinary least squares to fit models with numeric outcomes.”
The objective of ordinary least squares regression is to find the ˆβ that minimize the sum of square residuals,
n∑i=1ϵ2i
ϵi=yi−^yi
Side note: one can show that minimizing the sum of square residuals yields the best possible ˆβs when certain assumptions are made, e.g. when ϵi∼Normal
Once we fit a model according to the least squares criteria above, how do we assess how well our predictors explain the outcome? We can use a statistic called R2
Math definition:
R2=1−∑niϵ2i∑ni(yi−ˉy)2
Word definition:
R2=1−sum of squared errorsum of square distance from mean in data
Let’s focus on the second term to build intuition.
The numerator “sum of squared error” is a measure of how wrong our model is (the amount of variability not explained by the model)
The denominator is proportional to the variance i.e. the amount of variability in the data.
Together, the fraction represents the proportion of variability not explained by the model.
If the sum of squared error is 0, then the model explains all variability and R2=1−0=1.
If the proportion of error not explained is 1, i.e. the sum of squared error is the same as all the variability in the data, then model does not explain any variability and R2=1−1=0.
Final take-away: R2 is a measure of the proportion of variability the model explains. An R2 of 0 is a poor fit and R2 of 1 is a perfect fit.
A single predictor model: flipper length explains body mass.
bm_flipper_fit = linear_reg() %>%
set_engine("lm") %>%
fit(body_mass_g ~ flipper_length_mm, data = penguins)
glance(bm_flipper_fit) %>%
select(r.squared)
r.squared <dbl> | ||||
---|---|---|---|---|
0.7589925 |
Next, build a single predictor model with bill length as the predictor of body mass. Compare the R2 to the model above.
# code here
Next, build a multiple regression model with two predictors of body mass: bill length and flipper length.
# code here
This is the model from the previous exercise:
y=β0+x1β1+x2β2
y: body mass (g) x1: flipper length (mm) x2: bill length (mm)
We didn’t see a big increase in R2 when adding bill length as a second predictor of body mass.
Let’s conduct a hypothesis test in a regression framework. If bill length does not help us explain body mass, then β2 might as well be 0. Within the framework of hypothesis testing:
H0: β2=0
HA: β2≠0
For OLS regression, our test statistic is
T=ˆβ−0SEˆβ∼tn−2
R
takes care of much of this behind the scenes with the tidy output and reports a p-value for each β by default.
Display your model below in tidy format. Compare the p-value to one you calculate manually using the equation above.
# code here
# code here
Is βbill length significant at the α=0.05 level? State your conclusion.
penguins %>%
ggplot(aes(x = bill_length_mm, y = body_mass_g, color = island)) +
geom_point() +
theme_bw() +
geom_smooth(method = 'lm', se = F) +
labs(x = "Bill length (mm)", y = "Body mass (g)", title = "Body mass vs bill length by island", color = "Island")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).