ae05
, due FridayBy the end of the today you will combine data from different sources using a well-chosen join function.
library(tidyverse)
Instead of working with a single dataset, usually you will have to work with many different related datasets. To answer research questions using related datasets, we need to develop tools to join datasets together.
There are many possible types of joins. All have the format something_join(x, y)
.
inner_join()
: join all rows from x
where there are matching values in y
. Return all combinations in case of multiple matchesleft_join()
: include all rows from x
right_join()
: include all rows from y
full_join()
: include all rows in x
or y
semi_join()
: return all rows from x
with match in y
anti_join()
: return all rows from x
without a match in y
x <- tibble(value = c(100, 200, 300),
xcol = c("x1", "x2", "x3"))
y <- tibble(value = c(100, 200, 400),
ycol = c("y1", "y2", "y4"))
x
## # A tibble: 3 × 2
## value xcol
## <dbl> <chr>
## 1 100 x1
## 2 200 x2
## 3 300 x3
y
## # A tibble: 3 × 2
## value ycol
## <dbl> <chr>
## 1 100 y1
## 2 200 y2
## 3 400 y4
We will demonstrate each of the joins on these small, toy datasets.
inner_join(x, y)
## Joining, by = "value"
## # A tibble: 2 × 3
## value xcol ycol
## <dbl> <chr> <chr>
## 1 100 x1 y1
## 2 200 x2 y2
left_join(x, y)
## Joining, by = "value"
## # A tibble: 3 × 3
## value xcol ycol
## <dbl> <chr> <chr>
## 1 100 x1 y1
## 2 200 x2 y2
## 3 300 x3 <NA>
right_join(x, y)
## Joining, by = "value"
## # A tibble: 3 × 3
## value xcol ycol
## <dbl> <chr> <chr>
## 1 100 x1 y1
## 2 200 x2 y2
## 3 400 <NA> y4
full_join(x, y)
## Joining, by = "value"
## # A tibble: 4 × 3
## value xcol ycol
## <dbl> <chr> <chr>
## 1 100 x1 y1
## 2 200 x2 y2
## 3 300 x3 <NA>
## 4 400 <NA> y4
semi_join(x, y)
## Joining, by = "value"
## # A tibble: 2 × 2
## value xcol
## <dbl> <chr>
## 1 100 x1
## 2 200 x2
anti_join(x, y)
## Joining, by = "value"
## # A tibble: 1 × 2
## value xcol
## <dbl> <chr>
## 1 300 x3
How do the join functions above know to join x
and y
by value
? Examine the names to find out.
names(x)
## [1] "value" "xcol"
names(y)
## [1] "value" "ycol"
We will again work with data from the nycflights13
package.
library(nycflights13)
Examine the documentation for the datasets airports
, flights
, and planes
.
Question: How are these datasets related? Suppose you wanted to make a map of the route of every flight. What variables would you need from which datasets?
Join flights to airports. Note these two datasets have no variables in common so we will have to specify the variable to join by using by =
. Check out the documentation for more information.
flights %>%
left_join(airports, by = c("dest" = "faa"))
## # A tibble: 336,776 × 26
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with 336,766 more rows, and 18 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
## # tzone <chr>
Create a new dataset dest_delays
with the median arrival delay for each destination. Note this question does not require you to use joins.
Create a new dataset by joining dest_delays
and airports
. Only include observations that have both delay and airport information. Note dest_delays
and airports
have no variables in common so you will need to specify the variables to join using by
as in the example above.
Question: Are all of the observations in dest_delays
included in the new dataset you created by joining dest_delays
and airports
? Use an appropriate join function to investigate this issue and determine what is going on here.
tailnum
variable in the flights
dataset. The year the plane was manufactured is given in the year
variable in the planes
dataset.Step #1: Start by finding the average arrival delay for each plane and store the resulting dataset in plane_delays
.
Step #2: Join plane_delays
to the planes
data using an appropriate join and then use mutate to create an age variable. Note this data is from 2013.
Step #3: Finally, create an effective visualization of the data.