Lab #03: Data Wrangling

due January 31 at 11:59 PM

Goals

Getting started

Open the lab03.Rmd template and update the YAML header with your name and today’s date. Then, knit the document to PDF and make sure the resulting PDF file has the correct date. Stage, commit, and push your changes.

Write your answers in the lab03.Rmd template file. Your assignment should have at least three meaningful commits and all code chunks should have informative names.

Packages and Data

We will begin by loading the tidyverse package as usual.

library(tidyverse)

Let’s take a trip to the Midwest

The data we will examine is loaded automatically with the tidyverse. It is called midwest and contains demographic information about midwestern counties.

To begin, familiarize yourself with the dataset by reading the documentation. Remember, you can pull up the documentation by running ?midwest in the console.

  1. Which state has the largest population in the 2000 census? Determine this by using an uninterrupted pipeline with three dplyr commands (three of the verbs above) where you sum the population of all counties within each state and then order the states from least to greatest in population.

Now would be a good time to knit, commit, and push.

  1. What are the three most populated counties in Wisconsin? Using a single, uninterrupted pipeline, return a 3 X 2 tibble that lists the name of the county and the population of that county, starting with the county with the greatest population in Wisconsin, followed by the second, and then the third most populated.

  2. How does the mean population density of counties within a metropolitan area compared with those that are not in a metropolitan area? How many counties fall into each group? Return this information using a single, uninterrupted pipeline. (Hint: You will want to begin by using an if_else command to create a new variable using words for each group using the numerical variable in the dataset.)

Now might be a good time for another knit, commit, and push.

  1. Which five counties in the Midwest have the highest proportion of people with a college degree (percollege)? Return a 5 X 3 tibble that lists the county name, the state, and the percentage of residents with a college degree for these 5 counties. What do three of these counties have in common that might explain why they are on this list? (Hint: You may want to use Google to answer this question.)

  2. Some county names occur in more than one of these Midwest states. Are there any that occur in all five states? (You can assume that no state has a county name occur more than once within that state.) Return a tibble with the county name and a count of the number of occurrence (i.e., five) for all county names that occur in all five states.

One more exercise, but first, knit, commit, and push!

  1. Do some states have a higher percentage of their counties located in a metropolitan area?

Create a segmented bar chart with one bar per state and the fill determined by the distribution of metro, whether a county is considered in a metro area. Include informative labels and use best practices of data visualization. The y axis of the segmented barplot should range from 0 to 1. Note for this exercise you should begin with the code below

midwest <- midwest %>%
  mutate(metro = ifelse(inmetro == 1, "Yes", "No"))
What do you notice from the plot?

Once you are fully satisfied with your lab, Knit to PDF to create a PDF document.

Follow the instructions in previous labs to submit your PDF to Gradescope.

Be sure to identify which problems are on each page using Gradescope.

Grading

Overall: 50 pts.