filter()
: pick rows matching criteriaselect()
: pick columns by namemutate()
: add new variablesslice()
: pick rows using indicesarrange()
: reorder rowsgroup_by()
: for grouped operationssummarize()
: calculate summary statisticsAdditional background: click here for slides
Find your lab03
repo on the course organization GitHub page
Clone the repo using SSH
into RStudio
Open the lab03.Rmd
template and update the YAML header with your name and today’s date. Then, knit the document to PDF and make sure the resulting PDF file has the correct date. Stage, commit, and push your changes.
Write your answers in the lab03.Rmd
template file. Your assignment should have at least three meaningful commits and all code chunks should have informative names.
We will begin by loading the tidyverse
package as usual.
library(tidyverse)
The data we will examine is loaded automatically with the tidyverse. It is called midwest
and contains demographic information about midwestern counties.
To begin, familiarize yourself with the dataset by reading the documentation. Remember, you can pull up the documentation by running ?midwest
in the console.
dplyr commands
(three of the verbs above) where you sum the population of all counties within each state and then order the states from least to greatest in population.Now would be a good time to knit, commit, and push.
What are the three most populated counties in Wisconsin? Using a single, uninterrupted pipeline, return a 3 X 2 tibble that lists the name of the county and the population of that county, starting with the county with the greatest population in Wisconsin, followed by the second, and then the third most populated.
How does the mean population density of counties within a metropolitan area compared with those that are not in a metropolitan area? How many counties fall into each group? Return this information using a single, uninterrupted pipeline. (Hint: You will want to begin by using an if_else
command to create a new variable using words for each group using the numerical variable in the dataset.)
Now might be a good time for another knit, commit, and push.
Which five counties in the Midwest have the highest proportion of people with a college degree (percollege
)? Return a 5 X 3 tibble that lists the county name, the state, and the percentage of residents with a college degree for these 5 counties. What do three of these counties have in common that might explain why they are on this list? (Hint: You may want to use Google to answer this question.)
Some county names occur in more than one of these Midwest states. Are there any that occur in all five states? (You can assume that no state has a county name occur more than once within that state.) Return a tibble with the county name and a count of the number of occurrence (i.e., five) for all county names that occur in all five states.
One more exercise, but first, knit, commit, and push!
Create a segmented bar chart with one bar per state and the fill determined by the distribution of metro, whether a county is considered in a metro area. Include informative labels and use best practices of data visualization. The y axis of the segmented barplot should range from 0 to 1. Note for this exercise you should begin with the code below
<- midwest %>%
midwest mutate(metro = ifelse(inmetro == 1, "Yes", "No"))
What do you notice from the plot?
Once you are fully satisfied with your lab, Knit to PDF to create a PDF document.
Follow the instructions in previous labs to submit your PDF to Gradescope.
Be sure to identify which problems are on each page using Gradescope.
Overall: 50 pts.