library(tidyverse)
library(tidymodels)
library(scatterplot3d)
library(viridis)

Bulletin

Due today
- prep quiz 06
Upcoming
- Draft report due Monday March 28 to GitHub. Have your .Rmd and a PDF in GitHub

Today

By the end of today, you will…

understand least squares objective
explain and compare statistics
conduct a hypothesis test about a particular
model interactions between variables

We’ll continue examining the Palmer penguin dataset.

data(penguins)

Use ?penguins or click here for more info about the dataset.

Last time

One predictor model

Recall that is the actual observed outcome (from the data) and is the predicted outcome from the fitted model.

The fitted one predictor linear model:

where

are the fitted estimates of the true parameters

that could be computed if we had the entire population data.

How did we find and ?

We used the function linear_reg with the lm engine to fit the model. From the documentation, the ‘lm’ engine “uses ordinary least squares to fit models with numeric outcomes.”

Ordinary Least Squares

The objective of ordinary least squares regression is to find the that minimize the sum of square residuals,

where

is the number of observations (rows in the data) and

Side note: one can show that minimizing the sum of square residuals yields the best possible s when certain assumptions are made, e.g. when

Exercise 1

Click here to play with the interactive example

Describe what you see.

Once we fit a model according to the least squares criteria above, how do we assess how well our predictors explain the outcome? We can use a statistic called

Conceptualize

Math definition:

Word definition:

Let’s focus on the second term to build intuition.

The numerator “sum of squared error” is a measure of how wrong our model is (the amount of variability not explained by the model)
The denominator is proportional to the variance i.e. the amount of variability in the data.
Together, the fraction represents the proportion of variability not explained by the model.

If the sum of squared error is 0, then the model explains all variability and .

If the proportion of error not explained is , i.e. the sum of squared error is the same as all the variability in the data, then model does not explain any variability and .

Final take-away: is a measure of the proportion of variability the model explains. An of 0 is a poor fit and of 1 is a perfect fit.

Example:

A single predictor model: flipper length explains body mass.

bm_flipper_fit = linear_reg() %>%
set_engine("lm") %>%
fit(body_mass_g ~ flipper_length_mm, data = penguins)

glance(bm_flipper_fit) %>%
  select(r.squared)

ABCDEFGHIJ0123456789

r.squared <dbl>
0.7589925

Exercise 2

Next, build a single predictor model with bill length as the predictor of body mass. Compare the to the model above.

# code here

Exercise 3

Next, build a multiple regression model with two predictors of body mass: bill length and flipper length.

Before writing any code, do you think will increase, decrease or stay the same? Why?

# code here

Report and discuss

Hypothesis testing in a regression framework

This is the model from the previous exercise:

: body mass (g) : flipper length (mm) : bill length (mm)

We didn’t see a big increase in when adding bill length as a second predictor of body mass.

Let’s conduct a hypothesis test in a regression framework. If bill length does not help us explain body mass, then might as well be 0. Within the framework of hypothesis testing:

For OLS regression, our test statistic is

We want to see if our observed statistic,

, falls far in the tail under the null.

R takes care of much of this behind the scenes with the tidy output and reports a p-value for each by default.

Exercise 4

Display your model below in tidy format. Compare the p-value to one you calculate manually using the equation above.

# code here

# code here

Is significant at the level? State your conclusion.

Next time

Interactions

penguins %>%
  ggplot(aes(x = bill_length_mm, y = body_mass_g, color = island)) +
  geom_point() + 
  theme_bw() + 
  geom_smooth(method = 'lm', se = F) + 
  labs(x = "Bill length (mm)", y = "Body mass (g)", title = "Body mass vs bill length by island", color = "Island")

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

What’s going on in the figure above?

Regression 2: Multiple regression

March 23 2022

Bulletin

Today

Last time

One predictor model

Ordinary Least Squares

Exercise 1

Conceptualize

Example:

Exercise 2

Exercise 3

Hypothesis testing in a regression framework

Exercise 4

Next time

Interactions

Regression 2: Multiple regression

March 23 2022

Bulletin

Today

Last time

One predictor model

Ordinary Least Squares

Exercise 1

R2

Conceptualize R2

Example: R2

Exercise 2

Exercise 3

Hypothesis testing in a regression framework

Exercise 4

Next time

Interactions

Conceptualize

Example: