Homework #02: Data Wrangling and Joins

due February 4th 11:59 PM

Goals

For this assignment you must have at least three meaningful commits and all of your code chunks must have informative names.

For your first commit, update your author name in the YAML header of the template R Markdown file.

All plots should follow the best visualization practices: include an informed title, label axes, and carefully consider aesthetic choices.

All code should follow the tidyverse style guidelines, including not exceeding the 80 character limit.

For every join function you should explicitly specify the by argument

Setup

College Rankings and State Characteristics

We will work with the tidyverse package as usual. You may also want to use viridis.

library(tidyverse)
library(viridis)

The U.S. News rankings are an influential, but controversial metric that influences the college application process.

A brief description of the data sets for this lab and how they are related to each other is provided below.

The natunivs data set contains all schools in the National Universities category ranked 50 or above in the current (2022) rankings. Data on school rankings comes from Andy Reiter, with several blank values filled in by the professor. Observations are uniquely identified by school.

The variables in this data set are:

The slacs data set contains contains all schools in the National Liberal Arts Colleges category ranked 50 or above in the current (2022) rankings. Data on school rankings comes from Andy Reiter. Observations are uniquely identified by school.

The variables in this data set are the same as above:

The state_data data set contains three variables related to the characteristics of a state:

State economic outlook scores come from richstatespoorstates and are compiled from 15 markers of economic stability, including minimum wage, state tax burden and employment levels. The 2020 population data comes from the US Census.

Looking at this data

  1. Let’s start by creating a data set that includes information from all three data sets.

The final full_data data frame should have 107 observations and 7 variables.

We will use full_data for the remainder of the assignment. (Note that there are more than 100 observations total due to ties at 50.)

  1. Which states have the most schools in the full_data data set? Find the number of schools by state. Then, order these states from greatest to least and return the 5 states with the most schools on the list. Report these 5 states.

  2. Which states do not have a school in the full_data data set? Use the state_data data set and an appropriate join to help answer this question. Return a data set with two variables, state abbreviations and state population, in order from greatest population to least. Show all code and output, and print the state abbreviation and populations. What is the state with the largest population that does not have a school in the full_data data set?

Hint: to grab a column name that begins with a number, you need to put the column name in appropriate dashes, e.g. 

data %>%
select(`2020pop`)
  1. Recreate the below plot. Use a dplyr command to create the variable. After recreating the plot, discuss what patterns you observe. Hint: create a new variable to determine if a state is in the top 25 or bottom 25 of economic outlook in 2020. Note that 1-25 is the top ranking while 26-50 is the bottom ranking.

  1. Is there a relationship between state population and the number of schools it has in the full_data data set? To answer this question, first use the code from exercise 2 to create a new data set called counts with the the number of schools by state as a column.
  1. Let’s now focus on North Carolina schools in the full_data data set. For these schools, create a new variable that indicates the change in ranking from 2021 to 2022, where a positive value indicates an increased ranking (e.g., if a school went from 11 to 10, you would want this variable to have a value of positive 1.) Finally, return a tibble that shows the name of the NC schools and the new variable you created. Discuss what you observe.

  2. Does the economic outlook and populations of states where national universities are located differ from those where national liberal arts colleges are located?

Submission

Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo. Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all corresponding pages. Associate the “Overall” section with the first page.

Rubric