Lesson 14: Join Data

Introduction

The prior lessons focused on data analysis and manipulation with one dataset at a time. But in most real data analysis projects, you will have multiple datasets that contain different segments of data. This is especially true for data that is sourced from relational databases. Think about all the tables that are part of the data system that tracks people who are incarcerated. There are likely tables relating to individual characteristics like date of birth and gender, tables containing risk score or other evaluation data, tables with information about facility movements, and more. This is also the case for data stored across multiple Excel sheets or workbooks. To analyze this kind of data, you need to combine or join multiple tables to create a useful analytic dataset.

This lesson will introduce joins and demonstrate how to merge two tables with common variables or keys. Joining data is a complex topic and this lesson won’t cover everything, but it will show you the basics of how to combine data frames in R. Joins are a very common technique in querying relational databases, and many of the R functions described in this lesson are inspired by SQL’s join syntax. If you have experience writing SQL queries, the concepts in this lesson will be familiar to you.

The Joins chapter of R for Data Science offers a more in-depth discussion of joins.

For this lesson, you’ll need both the NPS admissions file and the NPS releases file. Read them in to R with the following code:

library(tidyverse)
library(readxl)

nps_admit <- read_csv("data/nps-admissions.csv")
nps_release <- read_excel("data/nps-releases.xlsx")

Keys

An important concept when thinking about joining tables is keys. There are different kinds of keys (primary, foreign, compound, etc.), but the core principle is that there must be some shared element between tables that allows you to connect the two tables. Often, this is a common column or variable such as an ID number, in the case of individual data. Generally, keys are a column or set of columns that uniquely identify an observation.

Data prep

It makes it much easier to join data frames if your data is in a tidy format first. So before attempting any joins, clean up and filter the admissions and releases data.

First, filter the admissions dataset so that it only contains rows for total admissions from California. Next, create a total admissions column by adding male and female admissions. Lastly, keep only the year, state name, and total admissions columns.

ca_admissions <- nps_admit |> 
  filter(adm_type == "adm_total", state_abbr == "CA") |> 
  mutate(n_admissions = m + f) |> 
  select(year, state_name, n_admissions)

ca_admissions

# A tibble: 45 × 3
    year state_name n_admissions
   <dbl> <chr>             <dbl>
 1  1978 California        12419
 2  1979 California        15940
 3  1980 California        14487
 4  1981 California        18024
 5  1982 California        22321
 6  1983 California        27511
 7  1984 California        29681
 8  1985 California        37883
 9  1986 California        48925
10  1987 California        59698
# ℹ 35 more rows

To clean up the releases data, again filter it so that it only contains California data, then group by year and sum the total release column to get the number of releases for men and women combined.

ca_releases <- nps_release |>
  filter(state_abbr == "CA") |> 
  group_by(year) |> 
  summarize(n_releases = sum(rel_total)) |> 
  ungroup()

ca_releases

# A tibble: 45 × 2
    year n_releases
   <dbl>      <dbl>
 1  1978      10207
 2  1979      14230
 3  1980      12483
 4  1981      13375
 5  1982      16651
 6  1983      23069
 7  1984      25923
 8  1985      31256
 9  1986      39526
10  1987      52611
# ℹ 35 more rows

Single key join

Now you’re ready to join these two data frames to create a single data frame with both the number of admissions and the number of releases. To do this, start with the ca_admissions, then use the left_join function to join with the ca_releases dataset, specifying that the column (or key) to use to join is year. Specifying this key ensures that the row with the number of admissions from 1978 is combined with the row with the number of releases from 1978, and so on for each year.

ca_admissions |> 
  left_join(ca_releases, join_by("year"))

# A tibble: 45 × 4
    year state_name n_admissions n_releases
   <dbl> <chr>             <dbl>      <dbl>
 1  1978 California        12419      10207
 2  1979 California        15940      14230
 3  1980 California        14487      12483
 4  1981 California        18024      13375
 5  1982 California        22321      16651
 6  1983 California        27511      23069
 7  1984 California        29681      25923
 8  1985 California        37883      31256
 9  1986 California        48925      39526
10  1987 California        59698      52611
# ℹ 35 more rows

Multiple key join

This worked well for the California data, but what if you wanted to join the admissions and releases datasets for all states? Let’s try it! First, perform some data wrangling to tidy these data frames.

all_admissions <- nps_admit |> 
  filter(adm_type == "adm_total") |> 
  mutate(n_admissions = m + f) |> 
  select(year, state_name, n_admissions)

all_admissions

# A tibble: 2,295 × 3
    year state_name n_admissions
   <dbl> <chr>             <dbl>
 1  1978 Alabama            2815
 2  1979 Alabama            2819
 3  1980 Alabama            3774
 4  1981 Alabama            4025
 5  1982 Alabama            4473
 6  1983 Alabama            4662
 7  1984 Alabama            4755
 8  1985 Alabama            4407
 9  1986 Alabama            4284
10  1987 Alabama            4843
# ℹ 2,285 more rows

all_releases <- nps_release |>
  group_by(year, state_name) |> 
  summarize(n_releases = sum(rel_total)) |> 
  ungroup()

all_releases

# A tibble: 2,295 × 3
    year state_name           n_releases
   <dbl> <chr>                     <dbl>
 1  1978 Alabama                    2984
 2  1978 Alaska                      322
 3  1978 Arizona                    1687
 4  1978 Arkansas                   1908
 5  1978 California                10207
 6  1978 Colorado                   1372
 7  1978 Connecticut                1716
 8  1978 Delaware                    508
 9  1978 District of Columbia       3335
10  1978 Florida                    7741
# ℹ 2,285 more rows

Next, join admissions and releases by year.

all_admissions |> 
  left_join(all_releases, join_by("year"))

Warning in left_join(all_admissions, all_releases, join_by("year")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

# A tibble: 117,045 × 5
    year state_name.x n_admissions state_name.y         n_releases
   <dbl> <chr>               <dbl> <chr>                     <dbl>
 1  1978 Alabama              2815 Alabama                    2984
 2  1978 Alabama              2815 Alaska                      322
 3  1978 Alabama              2815 Arizona                    1687
 4  1978 Alabama              2815 Arkansas                   1908
 5  1978 Alabama              2815 California                10207
 6  1978 Alabama              2815 Colorado                   1372
 7  1978 Alabama              2815 Connecticut                1716
 8  1978 Alabama              2815 Delaware                    508
 9  1978 Alabama              2815 District of Columbia       3335
10  1978 Alabama              2815 Florida                    7741
# ℹ 117,035 more rows

Oops! What just happened there? Look at the warning message that indicates that there are multiple row matches between the two tables. Next, notice the number of rows in the joined dataset: 117,045! That’s a lot more than the 2,295 rows in the admissions and releases data frames. Finally, see that there are 2 state name columns in the joined data frame (state_name.x and state_name.y). All of this points to an issue with the join that is important to examine.

The problem is that the code above only specifies the year column as the join. So, every 1978 row in the admissions data was matched to every 1978 row in the releases data. Look at the second row in the joined data above: the number of admissions in Alabama in 1978 is in the same row as the number of releases in Alaska in 1978.

To fix this problem, the join needs to be made on two columns: year and state_name. This ensures that rows from each table match both the year and state as they are combined.

all_admissions |> 
  left_join(all_releases, join_by("year", "state_name"))

# A tibble: 2,295 × 4
    year state_name n_admissions n_releases
   <dbl> <chr>             <dbl>      <dbl>
 1  1978 Alabama            2815       2984
 2  1979 Alabama            2819       3005
 3  1980 Alabama            3774       3452
 4  1981 Alabama            4025       3194
 5  1982 Alabama            4473       3091
 6  1983 Alabama            4662       3602
 7  1984 Alabama            4755       4150
 8  1985 Alabama            4407       3904
 9  1986 Alabama            4284       3529
10  1987 Alabama            4843       3745
# ℹ 2,285 more rows

Now, there is no warning about multiple matches, the resulting data frame has the same number of rows as the admissions and releases data frames, and there are no extra columns. This join was successful!

Types of joins

In the examples above, there were the same number of rows in each of the datasets to be joined, and each row in nps_admit matched to a single row in nps_release. But this isn’t always the case with real-life data. Consider trying to join the all_releases data frame with the ca_admissions data frame:

all_releases |>
  left_join(ca_admissions, join_by("year", "state_name"))

# A tibble: 2,295 × 4
    year state_name           n_releases n_admissions
   <dbl> <chr>                     <dbl>        <dbl>
 1  1978 Alabama                    2984           NA
 2  1978 Alaska                      322           NA
 3  1978 Arizona                    1687           NA
 4  1978 Arkansas                   1908           NA
 5  1978 California                10207        12419
 6  1978 Colorado                   1372           NA
 7  1978 Connecticut                1716           NA
 8  1978 Delaware                    508           NA
 9  1978 District of Columbia       3335           NA
10  1978 Florida                    7741           NA
# ℹ 2,285 more rows

Notice that there are rows with missing or NA values in the n_admissions column. In fact, the only rows with non-missing data are California rows. This makes sense, because the admissions data used in the join is the filtered ca_admissions dataset.

To understand why this is happening and how to adjust your code to handle joining data frames with different numbers of observations, the next sections discuss different types of joins.

Left join

In a left join, all the rows in the first (or left) data frame are retained, and the rows in the second (or right) data frame that match the first data frame are joined. If there are rows in the first data frame with no match in the second data frame, NA or missing values will be added in the joined columns. Additionally, if there are rows in the second data frame that don’t match to a key in the first data frame, they are not included in the joined data frame.

This animation shows a left join:

Inner join

In an inner join, only rows where there are matching columns in both the first and second data frame are retained. If there are rows in the first data frame with no match in the second data frame, they are discarded. Similarly, if there are rows in the second data frame with no match in the first data frame, they are also discarded.

This animation shows an inner join:

To perform an inner join in R, use the same code as above, but replace left_join with inner_join. In this case, the resulting joined data frame only has rows for California, even though the first data frame (all_releases) has rows for all states. Because the second data frame (ca_admissions) only has California rows, all non-matching rows from the first data frame are dropped.

all_releases |>
  inner_join(ca_admissions, join_by("year", "state_name"))

# A tibble: 45 × 4
    year state_name n_releases n_admissions
   <dbl> <chr>           <dbl>        <dbl>
 1  1978 California      10207        12419
 2  1979 California      14230        15940
 3  1980 California      12483        14487
 4  1981 California      13375        18024
 5  1982 California      16651        22321
 6  1983 California      23069        27511
 7  1984 California      25923        29681
 8  1985 California      31256        37883
 9  1986 California      39526        48925
10  1987 California      52611        59698
# ℹ 35 more rows

Full join

A full join keeps all rows and all columns from both the first and second data frames, regardless of matching. In rows where there are no matching values, NA is returned, similar to a left join.

This animation shows a full join:

Why join?

Often, to do any useful analysis, you’ll need to join multiple data frames. By looking at admissions and releases separately in prior lessons in this course, you missed out on some interesting trends. Now that you’re able to join data, one way to show these population movements would be to plot admissions and releases on the same chart.

To make this plot, start by joining California admissions and releases and then pivot the data frame so that admissions and releases become rows instead of columns.

ca_adm_rel_tidy <- ca_admissions |> 
  left_join(ca_releases, join_by("year")) |> 
  pivot_longer(starts_with("n"), names_to = "movement", values_to = "n")

From there, make a dodged bar chart with year on the x-axis, n on the y-axis, and the type of movement (admissions or release) as the fill. To make a chart where the bars are next to one another (rather than stacked) use position_dodge().

ca_adm_rel_tidy |>
  ggplot(aes(x = year, y = n, fill = movement)) +
  geom_col(position = position_dodge()) +
  scale_fill_manual(
    values = c("coral", "steelblue"),
    labels = c("Admissions", "Releases")
    ) +
  scale_y_continuous(labels = scales::label_comma()) +
  labs(
    title = "California Prison Admissions and Releases",
    caption = "BJS National Prisoner Statistics",
    x = NULL,
    y = NULL,
    fill = NULL
    ) +
  theme_minimal()

In California, more people were admitted than released from 1978 through the mid 2000s. Starting in 2008, more people were released than admitted in most years, and there was a huge drop in both admissions and releases starting in 2012. A big driver of this change was the prison realignment reform, which shifted responsibility from the state to the county for people convicted of low-level offenses.

The response to the COVID-19 pandemic is also apparent from this chart, where in 2020 far more people were released from prison in California than were admitted.

Resources

R for Data Science (2e), Chapter 19 Joins: https://r4ds.hadley.nz/joins.html
dplyr Two-table verbs vignette: https://dplyr.tidyverse.org/articles/two-table.html