library(tidyverse)
library(tidymodels)
library(scatterplot3d)
library(viridis)
By the end of today, you will…
We’ll continue examining the Palmer penguin dataset.
data(penguins)
Use ?penguins
or click here for more info about the dataset.
Recall that \(y\) is the actual observed outcome (from the data) and \(\hat{y}\) is the predicted outcome from the fitted model.
The fitted one predictor linear model:
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1} x_1 \] where \(\hat{\beta}\) are the fitted estimates of the true parameters \(\beta\) that could be computed if we had the entire population data.
We used the function linear_reg
with the lm
engine to fit the model. From the documentation, the ‘lm’ engine “uses ordinary least squares to fit models with numeric outcomes.”
The objective of ordinary least squares regression is to find the \(\hat{\beta}\) that minimize the sum of square residuals,
\[ \sum_{i=1}^n \epsilon_i ^2 \] where \(n\) is the number of observations (rows in the data) and
\[ \epsilon_i = y_i - \hat{y_i} \]
Side note: one can show that minimizing the sum of square residuals yields the best possible \(\hat{\beta}\)s when certain assumptions are made, e.g. when \(\epsilon_i \sim Normal\)
Once we fit a model according to the least squares criteria above, how do we assess how well our predictors explain the outcome? We can use a statistic called \(R^2\)
Math definition:
\[ R^2 = 1 - \frac{\sum_i^n \epsilon_i^2}{\sum_i^n (y_i - \bar{y})^2} \]
Word definition:
\[ R^2 = 1 - \frac{\text{sum of squared error}}{\text{sum of square distance from mean in data}} \]
Let’s focus on the second term to build intuition.
The numerator “sum of squared error” is a measure of how wrong our model is (the amount of variability not explained by the model)
The denominator is proportional to the variance i.e. the amount of variability in the data.
Together, the fraction represents the proportion of variability not explained by the model.
If the sum of squared error is 0, then the model explains all variability and \(R^2 = 1 - 0 = 1\).
If the proportion of error not explained is \(1\), i.e. the sum of squared error is the same as all the variability in the data, then model does not explain any variability and \(R^2 = 1 - 1 = 0\).
Final take-away: \(R^2\) is a measure of the proportion of variability the model explains. An \(R^2\) of 0 is a poor fit and \(R^2\) of 1 is a perfect fit.
A single predictor model: flipper length explains body mass.
bm_flipper_fit = linear_reg() %>%
set_engine("lm") %>%
fit(body_mass_g ~ flipper_length_mm, data = penguins)
glance(bm_flipper_fit) %>%
select(r.squared)
Next, build a single predictor model with bill length as the predictor of body mass. Compare the \(R^2\) to the model above.
# code here
Next, build a multiple regression model with two predictors of body mass: bill length and flipper length.
# code here
This is the model from the previous exercise:
\[ y = \beta_0 + x_1 \beta_1 + x_2 \beta_2 \]
\(y\): body mass (g) \(x_1\): flipper length (mm) \(x_2\): bill length (mm)
We didn’t see a big increase in \(R^2\) when adding bill length as a second predictor of body mass.
Let’s conduct a hypothesis test in a regression framework. If bill length does not help us explain body mass, then \(\beta_2\) might as well be 0. Within the framework of hypothesis testing:
\(H_0\): \(\beta_2 = 0\)
\(H_A:\) \(\beta_2 \neq 0\)
For OLS regression, our test statistic is
\[ T = \frac{\hat{\beta} - 0}{\text{SE}_{\hat{\beta}}} \sim t_{n - 2} \] We want to see if our observed statistic, \(\hat{T}\), falls far in the tail under the null.
R
takes care of much of this behind the scenes with the tidy output and reports a p-value for each \(\beta\) by default.
Display your model below in tidy format. Compare the p-value to one you calculate manually using the equation above.
# code here
# code here
Is \(\beta_{\text{bill length}}\) significant at the \(\alpha = 0.05\) level? State your conclusion.
penguins %>%
ggplot(aes(x = bill_length_mm, y = body_mass_g, color = island)) +
geom_point() +
theme_bw() +
geom_smooth(method = 'lm', se = F) +
labs(x = "Bill length (mm)", y = "Body mass (g)", title = "Body mass vs bill length by island", color = "Island")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).