Lab #08: CLT and Intro to Regression

Learning Goals

In this lab you will…

Getting started

Packages

We will use the tidyverse and tidymodels packages in this lab.

library(tidyverse)
library(tidymodels)

Data

The data for this lab is from the Ultra Trail Running data set featured in Tidy Tuesday You can find the codebook with variable definitions in the Tidy Tuesday repo.

Use the code below to load the data sets into R.

ultra_rankings = read_csv("data/ultra_rankings.csv")
race = read_csv("data/race.csv")

Exercises

Instructions

Exercise 1

To begin, join the data frames. Save your result as ultra.

Your final result should have 60924 observations and 20 variables.

Exercise 2

Next we will examine races that are 170 km.

Exercise 3

Assuming the 170 km race times are randomly sampled, does the central limit theorem hold?

Use CLT to construct a 90% confidence interval. You can manually compute

\[ \bar{x} \pm t^{*}_{n-1} \times \frac{s}{\sqrt{n}} \] or using the infer framework.

Exercise 4

Let’s return to examining all the races from ultra in question 1.

To begin, create a new data frame, removing all observations with a distance of 0, call this new dataframe ultra_r0.

It’s plausible that races with more elevation gain will take longer to complete. Let’s investigate the relationship between elevation gain and race time.

Exercise 5

Fit the linear model from the previous exercise. Write the linear model out in \(x\), \(\hat{y}\) notation, replacing \(\hat{\beta}\) with the fitted constants you found from the previous exercise.

Use the equation above (and R as a calculator) to predict race time of a race with an elevation gain of 2000 meters.

Would this equation be appropriate to predict a race with an elevation gain of 20km?

Exercise 6

Elevation gain is only one predictor of race time. One might also consider runner age a useful predictor of race time.

Exercise 7

Create a model with two predictors for race time, namely elevation gain and age category (above or below 65).

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]

# example code
# race_fit = linear_reg() %>%
#   set_engine("__") %>%
#   fit(outcome ~ predictor1 + predictor2, data = __)

Submission

There should only be one submission per team on Gradescope.

Grading

Component Points
Ex 1 4
Ex 2 6
Ex 3 6
Ex 4 8
Ex 5 8
Ex 6 7
Ex 7 6
Workflow & formatting 5