Bulletin

Today

By the end of today you will

Hot Keys

Task / function Windows & Linux macOS
Insert R chunk Ctrl+Alt+I Command+Option+I
Knit document Ctrl+Shift+K Command+Shift+K
Run current line Ctrl+Enter Command+Enter
Run current chunk Ctrl+Shift+Enter Command+Shift+Enter
Run all chunks above Ctrl+Alt+P Command+Option+P
<- Alt + - Option + -
%>% Ctrl+Shift+M Command+Shift+M

Lecture Notes and Exercises

library(tidyverse)
library(sf)

Spatial data is different.*

Our typical “tidy” dataframe.

mpg

A new simple feature object.

nc <- st_read("data/nc_regvoters.shp", quiet = TRUE)
nc
  1. Question: What differences do you observe when comparing a typical tidy data frame to the new simple feature object?

Simple features

A simple feature is a standard, formal way to describe how real-world spatial objects (country, building, tree, road, etc) can be represented by a computer.

The package sf implements simple features and other spatial functionality using tidy principles. Simple features have a geometry type. Common choices are shown in the slides associated with today’s lecture.

Simple features are stored in a data frame, with the geographic information in a column called geometry. Simple features can contain both spatial and non-spatial data.

All functions in the sf package helpfully begin st_.

sf and ggplot

To read simple features from a file or database use the function st_read().

nc <- st_read("data/nc_regvoters.shp", quiet = TRUE)

Notice nc contains both spatial and nonspatial information.

We can build up a visualization layer-by-layer beginning with ggplot. Let’s start by making a basic plot of North Carolina counties.

ggplot(nc) +
  geom_sf() +
  labs(title = "North Carolina counties")

Now adjust the theme with theme_bw().

ggplot(nc) +
  geom_sf() +
  labs(title = "North Carolina counties with theme") + 
  theme_bw()

Now adjust color in geom_sf to change the color of the county borders.

ggplot(nc) +
  geom_sf(color = "darkgreen") +
  labs(title = "North Carolina counties with theme and aesthetics") + 
  theme_bw() 

Then increase the width of the county borders using size.

ggplot(nc) +
  geom_sf(color = "darkgreen", size = 1.5) +
  labs(title = "North Carolina counties with theme and aesthetics") +
  theme_bw()

Fill the counties by specifying a fill argument.

ggplot(nc) +
  geom_sf(color = "darkgreen", size = 1.5, fill = "orange") +
  labs(title = "North Carolina counties with theme and aesthetics") +
  theme_bw()

Finally, adjust the transparency using alpha.

ggplot(nc) +
  geom_sf(color = "darkgreen", size = 1.5, fill = "orange", alpha = 0.50) +
  labs(title = "North Carolina counties with theme and aesthetics") +
  theme_bw()

North Carolina Registered Voters

The nc data was obtained from the NC Board of Elections website and contains statistics on NC registered voters as of September 4, 2021.

The dataset contains the following variables on all North Carolina counties, categories provided by the NCSBE:

  • county: county name
  • dem: total number of voters who are registered Democrats
  • gop: total number of voters who are registered Republicans
  • lib: total number of voters who are registered Libertarians
  • unaf: total number of voters who are unaffiliated
  • white: total number of voters who are white
  • black: total number of voters who are Black
  • ntv_a: total number of voters who are Native American
  • ntv_h: total number of voters who are Native Hawaiian
  • other: total number of voters who are classified as “other” for race
  • hispanic: total number of voters who are Hispanic
  • male: total number of voters who identify as male
  • female: total number of voters who identify as female
    • Please note- these are the only options given by the NCBSE, but male + female do not add up to total since some voters either decide not to disclose or have a different gender identity than these options.
  • total: total number of registered voters in that county
  • geometry: geographic coordinates of the county

Let’s use the NCBSE data to generate a choropleth map of the number of registered voters by county.

ggplot(nc) +
  geom_sf(aes(fill = total)) + 
  labs(title = "Number of Registered Voters by County",
       fill = "# voters") + 
  theme_bw() 

It is sometimes helpful to pick diverging colors, colorbrewer2 can help.

One way to set fill colors is with scale_fill_gradient().

ggplot(nc) +
  geom_sf(aes(fill = total)) +
  scale_fill_gradient(low = "#fee8c8", high = "#7f0000") +
  labs(title = "The Triangle and Charlotte have the Most Voters",
       fill = "# cases") + 
  theme_bw() 

Challenges

  1. Different types of data exist (raster and vector).

  2. The coordinate reference system (CRS) matters.

  3. Manipulating spatial data objects is similar, but not identical to manipulating data frames.

dplyr

The sf package plays nicely with our earlier data wrangling functions from dplyr.

select()

Maybe you are interested in the partisan breakdown of a county.

nc %>% 
  select(county, dem, gop, total)

mutate()

Maybe you are interested in the percentage of registered democrats/republicans in a county.

nc %>% 
  mutate(pct_dem = dem/total,
         pct_rep = gop/total) %>%
  select(pct_dem, pct_rep)

filter()

You could filter for the percentage of democrats being over 50% (a majority).

nc %>% 
  mutate(pct_dem = dem/total) %>%
  filter(pct_dem > 0.5)

summarize()

We can also calculate summary statistics for our new variable.

nc %>% 
  mutate(pct_dem = dem/total) %>%
  summarize(mean_pct_dem = mean(pct_dem),
            min_pct_dem = min(pct_dem),
            max_pct_dem = max(pct_dem))

Geometries are “sticky”. They are kept until deliberately dropped using st_drop_geometry.

nc %>% 
  select(county, total) %>% 
  st_drop_geometry()

Practice

  1. Construct an effective visualization investigating the per county percentage of NC voters that are Native American. Use #f7fbff as “low” on the color gradient and #08306b as “high”. Which county has the highest percentage of Native American voters? (You might want to use Google here.)
# code here
  1. Write a brief research question that you could answer with this dataset and then investigate it here.

  2. What are limitations of your visualizations above?