In this lab you will…
A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the course organization on GitHub.
You should see a repo with the prefix lab08.
Each person on the team should clone the repository and open a new project in RStudio.
We will use the tidyverse and tidymodels packages in this lab.
library(tidyverse)
library(tidymodels)
The data for this lab is from the Ultra Trail Running data set featured in Tidy Tuesday You can find the codebook with variable definitions in the Tidy Tuesday repo.
Use the code below to load the data sets into R.
= read_csv("data/ultra_rankings.csv")
ultra_rankings = read_csv("data/race.csv") race
Make sure we see all relevant code and output in the knitted PDF. If you use inline code, make sure we can still see the code used to derive that answer.
Write a narrative for each exercise.
All narrative should be written in full sentences, and visualizations should have clear title and axis labels.
To begin, join the data frames. Save your result as ultra
.
Next, clean your data frames by removing observations with duplicate runners (hint: use distinct
).
Finally, drop all rows without an observed race time
.
Your final result should have 60924 observations and 20 variables.
Next we will examine races that are 170 km.
Create a histogram of race times (in seconds) using bins = 30
.
Next, find the mean race time (in seconds) for all races that are 170 km.
Construct and report a 90% bootstrap confidence interval using set.seed = 2
in the code chunk and 10000 reps.
Assuming the 170 km race times are randomly sampled, does the central limit theorem hold?
Use CLT to construct a 90% confidence interval. You can manually compute
\[
\bar{x} \pm t^{*}_{n-1} \times \frac{s}{\sqrt{n}}
\] or using the infer
framework.
Let’s return to examining all the races from ultra
in question 1.
To begin, create a new data frame, removing all observations with a distance
of 0, call this new dataframe ultra_r0
.
It’s plausible that races with more elevation gain will take longer to complete. Let’s investigate the relationship between elevation gain and race time.
Create a scatterplot of race time vs elevation gain.
Add a linear trendline and remove the error interval with se = FALSE
Write the equation of the linear model in \(x\), \(\beta\), \(y\) notation and identify \(x\) and \(y\).
Fit the linear model from the previous exercise. Write the linear model out in \(x\), \(\hat{y}\) notation, replacing \(\hat{\beta}\) with the fitted constants you found from the previous exercise.
Use the equation above (and R as a calculator) to predict race time of a race with an elevation gain of 2000 meters.
Would this equation be appropriate to predict a race with an elevation gain of 20km?
Elevation gain is only one predictor of race time. One might also consider runner age a useful predictor of race time.
First, look at the range of ages and remove any observations from ultra_r0
that are impossible.
Next, create a new variable called age_cat
that categorizes runners based on whether they are below 65 years old or not.
Copy your plot from exercise 4, coloring the points by your new age category. Be sure to clean up the legend labels.
What do you notice?
Create a model with two predictors for race time, namely elevation gain and age category (above or below 65).
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]
Label \(y\), \(x_1\) and \(x_2\) in context.
Fit the regression model, using the code below as a template.
# example code
# race_fit = linear_reg() %>%
# set_engine("__") %>%
# fit(outcome ~ predictor1 + predictor2, data = __)
Interpret the meaning of the age category slope.
Did the slope of elevation gain change with the addition of age as a covariate? What might this mean?
There should only be one submission per team on Gradescope.
Component | Points |
---|---|
Ex 1 | 4 |
Ex 2 | 6 |
Ex 3 | 6 |
Ex 4 | 8 |
Ex 5 | 8 |
Ex 6 | 7 |
Ex 7 | 6 |
Workflow & formatting | 5 |