For this assignment you must have at least three meaningful commits and all of your code chunks must have informative names.
For your first commit, update your author name in the YAML header of the template R Markdown file.
All plots should follow the best visualization practices: include an informed title, label axes, and carefully consider aesthetic choices.
All code should follow the tidyverse style guidelines, including not exceeding the 80 character limit.
For every join
function you should explicitly specify the by
argument
Go to the Github Organization page and open your hw02-username
repo
Clone the repository, open a new project in RStudio.
We will work with the tidyverse
package as usual. You may also want to use viridis.
library(tidyverse)
library(viridis)
The U.S. News rankings are an influential, but controversial metric that influences the college application process.
A brief description of the data sets for this lab and how they are related to each other is provided below.
The natunivs
data set contains all schools in the National Universities category ranked 50 or above in the current (2022) rankings. Data on school rankings comes from Andy Reiter, with several blank values filled in by the professor. Observations are uniquely identified by school.
The variables in this data set are:
school
: The name of the college or university.state
: The state in which the college or university is located.rank_2022
: The school’s rank in the 2022 issue.rank_2021
: The school’s rank in the 2021 issue.natuniv_slac
: A variable identifying the type of school.The slacs
data set contains contains all schools in the National Liberal Arts Colleges category ranked 50 or above in the current (2022) rankings. Data on school rankings comes from Andy Reiter. Observations are uniquely identified by school.
The variables in this data set are the same as above:
school
: The name of the college or university.state
: The state in which the college or university is located.rank_2022
: The school’s rank in the 2022 issue.rank_2021
: The school’s rank in the 2021 issue.natuniv_slac
: A variable identifying the type of school.The state_data
data set contains three variables related to the characteristics of a state:
abbrev
: The state’s abbreviation.2020_pop
: The state’s population in the 2020 census.2020economicoutlook
: A state’s economic outlook score (1-50)State economic outlook scores come from richstatespoorstates and are compiled from 15 markers of economic stability, including minimum wage, state tax burden and employment levels. The 2020 population data comes from the US Census.
First, join the slacs
data set to the natunivs
data set. The goal is to combine these data sets in such as a way so that the new data set has the same number of rows and unique columns as each of the individual data sets. Call this new data set full_data
.
Next, use a join to add the columns from the state_data
data set to full_data
.
The final full_data data frame should have 107 observations and 7 variables.
We will use full_data
for the remainder of the assignment. (Note that there are more than 100 observations total due to ties at 50.)
Which states have the most schools in the full_data
data set? Find the number of schools by state. Then, order these states from greatest to least and return the 5 states with the most schools on the list. Report these 5 states.
Which states do not have a school in the full_data
data set? Use the state_data
data set and an appropriate join to help answer this question. Return a data set with two variables, state abbreviations and state population, in order from greatest population to least. Show all code and output, and print the state abbreviation and populations. What is the state with the largest population that does not have a school in the full_data
data set?
Hint: to grab a column name that begins with a number, you need to put the column name in appropriate dashes, e.g.
%>%
data select(`2020pop`)
dplyr
command to create the variable. After recreating the plot, discuss what patterns you observe. Hint: create a new variable to determine if a state is in the top 25 or bottom 25 of economic outlook in 2020. Note that 1-25 is the top ranking while 26-50 is the bottom ranking.full_data
data set? To answer this question, first use the code from exercise 2 to create a new data set called counts
with the the number of schools by state as a column.Then, use a join to add counts
to the full_data
data set and make a scatter plot with a state’s 2020 population as the x-axis variable and the number of schools as the y-axis variable. Add a line of best fit using the method = lm
option. Include an informative title and axes labels.
Finally, describe what you observe. Is there a positive or negative relationship? Do most points fall near the line?
Let’s now focus on North Carolina schools in the full_data
data set. For these schools, create a new variable that indicates the change in ranking from 2021 to 2022, where a positive value indicates an increased ranking (e.g., if a school went from 11 to 10, you would want this variable to have a value of positive 1.) Finally, return a tibble that shows the name of the NC schools and the new variable you created. Discuss what you observe.
Does the economic outlook and populations of states where national universities are located differ from those where national liberal arts colleges are located?
Using the full_data
data set, find the mean population and mean economic outlook ranking for national universities vs national liberal arts colleges.
Report your results in a 2x3 tibble and discuss.
Knit to PDF to create a PDF document. Stage and commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo. Only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all corresponding pages. Associate the “Overall” section with the first page.