Goals

Continue practicing a collaborative data science workflow
Calculate marginal, joint, and conditional probabilities in a reproducible way
Visualize categorical data
Use visualizations and probabilities to describe the association between two categorical variables.

Merge Conflicts (uh oh)

Before you begin, it’s important to know about merge conflicts.

Merge conflicts occur when two or more people are working on the same file at the same time.

When two collaborators make changes to the same file and push the file to their repository, git merges the two files.

If these two files have conflicting content on the same line, git will produce a merge conflict. Merge conflicts need to be resolved manually, as they require a human intervention:

To resolve the merge conflict, decide if you want to keep only your text, the text on GitHub, or incorporate changes from both texts. Delete the conflict markers <<<<<<<, =======, >>>>>>> and make the changes you want in the final merge.

Ultimately, the goal is to avoid merge conflicts. One way to do this is to ensure only one person is typing in the team’s R Markdown document at any given time.

Click here to see the FAQ with step-by-step instructions for troubleshooting a merge conflict.

Getting started

A repository has already been created for you and your teammates. Everyone in your team has access to the same repo.
Go to the course organization on GitHub.
You should see a repo with the *lab-05** prefix.
Each person on the team should clone the repository and open a new project in RStudio. Do not make any changes to the .Rmd file until the instructions tell you do to so.

Workflow: Using git and GitHub as a team

Assign each person on your team a number 1 through 4. For teams of three, Team Member 1 can take on the role of Team Member 4.
The following exercises must be done in order. Only one person should type in the .Rmd file and push updates at a time. When it is not your turn to type, you should still share ideas and contribute to the team’s discussion.

Update YAML

Team Member 1: Change the author to your team name and include each team member’s name in the author field of the YAML in the following format. Team Name: Member 1, Member 2, Member 3, Member 4. Knit, commit, and push the changes to GitHub.

Team Members 2, 3, 4: Click the Pull** button in the Git pane to get the updated document. You should see the updated name in the .Rmd file.**

Packages

We will use the tidyverse and knitr packages in this lab. You can also use the viridis for the visualizations.

library(tidyverse)
library(knitr)
library(viridis)

Data

Today, we will be working with data from the first three full seasons of the NC Courage, a highly successful National Women’s Soccer League (NWSL) team located near Duke in Cary, NC. The Courage moved to the Triangle from Western New York in 2017 and had three very successful first seasons, culminating in winning the championship game that was held at their stadium in Cary in 2019! (Data for this lab was sourced from the nwslR package on Github, and verified with the NC Courage website by Meredith Brown in a previous semester.)

Use the code below to load the data set.

courage <- read_csv("data/courage.csv")
courage

## # A tibble: 78 × 10
##    game_id  game_date game_number home_team away_team opponent home_pts away_pts
##    <chr>    <chr>           <dbl> <chr>     <chr>     <chr>       <dbl>    <dbl>
##  1 washing… 4/15/2017           1 WAS       NC        WAS             0        1
##  2 north-c… 4/22/2017           2 NC        POR       POR             1        0
##  3 north-c… 4/29/2017           3 NC        ORL       ORL             3        1
##  4 boston-… 5/7/2017            4 BOS       NC        BOS             0        1
##  5 orlando… 5/14/2017           5 ORL       NC        ORL             3        1
##  6 north-c… 5/21/2017           6 NC        CHI       CHI             1        3
##  7 north-c… 5/24/2017           7 NC        NJ        NJ              2        0
##  8 chicago… 5/27/2017           8 CHI       NC        CHI             3        2
##  9 north-c… 6/3/2017            9 NC        KC        KC              2        0
## 10 north-c… 6/17/2017          10 NC        BOS       BOS             3        1
## # … with 68 more rows, and 2 more variables: result <chr>, season <dbl>

Exercises

For each exercise, show all relevant code and output used to obtain your response. Write all narrative in complete sentences, and use clear axis labels and titles on visualizations.

🧶 ✅ ⬆️ Team Member 1: If you haven’t already, change the author to your team name and include each team member’s name in the author field of the YAML in the following format. Team Name: Team member 1, Team member 2, Team member 3, Team member 4.

Team member 1: Type the team’s response to Exercises 1 and 2.

How many observations are in this dataset? What does each observation represent? (You do not need to create a code chunk here)
Each season the Courage play 26 games. We want to find out whether they win more in the early, middle or late season.

Create a new variable called seasonal_category that classifies NC courage games as early (games 1-9), middle (games 10-17), or late (18-26) season.
Create a new variable called win that takes the value 0 if the courage lose and 1 if they win.
Create a 3 x 2 tibble that has the season category (early, middle late) in one column and the proportion of wins in the second column.
Save your tibble as seasonal_courage and print it to the screen.
Describe what this table shows using the term “conditional probability”.

🧶 ✅ ⬆️ Team member 1: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 1-3.

Team Member 2: It’s your turn! Type the team’s response to exercises 3 - 4.

By default, R will arrange the categories of a categorical variable in alphabetical order in any output and visualizations, but we want the levels for seasonal_category to be in logical order. To achieve this, we will use the factor() function to make both of these variables factors (categorical variables with ordering) and specify the levels we wish to use.

The code to reorder levels for seasonal_category is below.

Uncomment and run the code below. NOTE: you will need to save your output from the previous question as seasonal_courage.
Add one line of code to the chunk below so that the seasonal categories print in the correct order (early then middle, then late). Hint: what dplyr verb changes the order of output?

# seasonal_courage %>%
#   mutate(seasonal_category =
#            factor(seasonal_category,
#                   levels = c("early", "middle", "late"),
#                   ordered = TRUE))

Based on the data,

what is the marginal probability the Courage win a game?
what is the conditional probability the Courage win a game given it was a home game?
Based on your findings, would you say winning is independent of whether they are playing at their home-field? Why? What does this say about home-field advantage?

🧶 ✅ ⬆️ Team member 2: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 3 - 4.

Team Member 3: It’s your turn! Type the team’s response to exercise 5.

Independence, contingency tables and ties.

Create a new column called home_courage that takes values “home” if Courage is the home team and “away” if Courage is the away team, save this data frame.

Using the data frame above, create a 3 x 3 contingency table (dimensions are including column/row names) with - columns denoting whether or not a game is home or away for the Courage and - rows denoting whether the Courage win, lose or tie.

Use the contingency table to find
- The marginal probability a game is home
- The marginal probability a tie
- The conditional probability of a game being at home given the game was a tie

Bayes’ theorem tells us that

\[ P(\text{tie} \ | \ \text{home}) = \frac{P(\text{home} \ | \ \text{tie}) P(\text{tie})}{P(\text{home})} \]

Using Bayes’ theorem, find and report the conditional probability a game is a tie given a game is home. Check your result using the contingency table.
Finally, is the probability of a tie independent of the Courage playing at home or away? Why?

🧶 ✅ ⬆️ Team member 3: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the responses to exercises 3 - 4.

Team Member 4: It’s your turn! Type the team’s response to exercises 6 and 7.

Building on the data frame from the previous exercise, create 3 new columns

Create a new column called total_pts that denotes the total number of points scored in each game.
Create another new column courage_pts that shows the number of points Courage scores in each game (hint: sometimes Courage is the away team, sometimes they are the home team).
Finally, create a third column called opponent_pts that show the number of points the opposing team scores per game.

Plot - Create a scatter plot of opponent_pts (y) vs courage_pts (x). - Color the scatter plot by whether the Courage are home or away. - Add geom_jitter as well as the line \(y = x\) using the code below. - Facet your plot by season - Be sure to include informative titles for the plot, axes and legend. What do you notice?

# ggplot(...code here...) + 
# geom_jitter(width = 0.1, height = 0.1) + 
# geom_abline(slope=1, intercept=0)

🧶 ✅ ⬆️ Team member 4: Knit, commit and push your changes to GitHub with an informative commit message. Make sure to commit and push all changed files so that your Git pane is empty afterwards.

All other team members: Pull to get the updated documents from GitHub. Click on the .Rmd file, and you should see the team’s completed lab!

Wrapping up

Go back through your write up to make sure you followed the coding style guidelines we discussed in class (e.g. no long lines of code).

Submission

Select one team member to upload the team’s PDF submission to Gradescope.
Be sure to select every team member’s name in the Gradescope.
Associate the “Workflow & formatting” graded section with the first page of your PDF, and mark the page associated with each exercise. If any answer spans multiple pages, then mark all pages.

There should only be one submission per team on Gradescope.

Grading

Component	Points
Ex 1	2
Ex 2	8
Ex 3	3
Ex 4	8
Ex 5	12
Ex 6	11
Workflow & formatting	6

The “Workflow & formatting” grade is to assess the reproducible workflow. This includes having at least one informative commit by each team member, labeling the code chunks, and having readable code in the tidyverse syntax that does not exceed 80 characters (i.e., we can read all your code in the knitted PDF.)

Lab #05: Probability and NC Courage

Due Monday February 21 at 11:59pm