Lesson 12: Summarize Data

Introduction

So far, you’ve learned how to remove and sort rows of a data frame and rename, remove, and create columns of a data frame. This lesson covers another key data frame manipulation: aggregation or summarization of rows of a data frame. Aggregation is important because it allows you to calculate summary statistics and otherwise reduce or combine observations.

The examples you’ll work on in this lesson use the same nps_release data frame that you’ve worked with in previous lessons. Read in nps-releases.xlsx with the following code:

library(tidyverse)
library(readxl)

nps_release <- read_excel("data/nps-releases.xlsx")

Next, create a subset of the dataset—releases from Nevada during the 2010s:

nv_2010s <- nps_release |> 
  filter(state_abbr == "NV", between(year, 2010, 2019))

A nice trick you can use to filter for a range of years (or any numerical variable!) is the between() function. The parameters for between() are the variable name, the lower (or left) bound, and the upper (or right) bound. Note that between() includes both the upper and lower bound, so in this case 2010 and 2019 will be included in the filtered dataset.

nv_2010s
# A tibble: 20 × 8
    year state_name state_abbr sex   rel_total rel_uncond rel_cond rel_oth
   <dbl> <chr>      <chr>      <chr>     <dbl>      <dbl>    <dbl>   <dbl>
 1  2010 Nevada     NV         m          5248       1825     3382      41
 2  2010 Nevada     NV         f           788        200      587       1
 3  2011 Nevada     NV         m          5161       1983     3141      37
 4  2011 Nevada     NV         f           749        159      589       1
 5  2012 Nevada     NV         m          4919       1864     2952      35
 6  2012 Nevada     NV         f           704        150      554       1
 7  2013 Nevada     NV         m          4919       1864     2952      35
 8  2013 Nevada     NV         f           704        150      554       1
 9  2014 Nevada     NV         m          5041       1901     2745     395
10  2014 Nevada     NV         f           797        206      585       6
11  2015 Nevada     NV         m          5041       1901     2745     395
12  2015 Nevada     NV         f           797        206      585       6
13  2016 Nevada     NV         m          4963       2236     2684      43
14  2016 Nevada     NV         f           815        289      525       1
15  2017 Nevada     NV         m          5472       2154     3273      45
16  2017 Nevada     NV         f          1076        247      827       2
17  2018 Nevada     NV         m          5431       1818     3568      45
18  2018 Nevada     NV         f          1003        231      771       1
19  2019 Nevada     NV         m          5678       1928     3707      43
20  2019 Nevada     NV         f          1011        253      758       0

summarize()

The main function to aggregate or reduce rows in a data frame is summarize(). Summarize is different from other functions that operate on rows like filter() and arrange(), since it changes the data in a data frame rather than removing or sorting existing rows. In its simplest form, summarize() can be used to calculate summary statistics for a data frame.

The British spelling, summarise(), works exactly the same as summarize().

As an example, start with the Nevada releases data frame from above and calculate the number of total releases throughout the 2010s.

nv_2010s |> 
  summarize(total_releases = sum(rel_total))
# A tibble: 1 × 1
  total_releases
           <dbl>
1          60317

In this example, you created a new variable called total_releases that is the sum of the rel_total column. Between 2010 and 2019 in Nevada, 60,317 people were released from prison.

sum() is not the only function you can use within summarize—any function that reduces a set of values to a single value will work. For instance, you can calculate the average number of total releases per year using the mean() function.

nv_2010s |> 
  summarize(total_releases_avg = mean(rel_total))
# A tibble: 1 × 1
  total_releases_avg
               <dbl>
1              3016.

You can also calculate more than one summary statistic in a single summarize() call. Here, for example, sum the number of total releases and number of unconditional releases.

nv_2010s |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond)
    )
# A tibble: 1 × 2
  total_releases uncond_releases
           <dbl>           <dbl>
1          60317           21565

group_by()

While summarize() is useful on its own, it is often used in combination with the group_by() function to calculate summary statistics by group. Recall that the releases dataset contains a sex column, which separates the number of releases by men and women. If you take the same summarization code from above and add group_by(sex) before summarize(), you get the sum of total releases and unconditional releases, but with one row for women and one row for men.

nv_2010s |> 
  group_by(sex) |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond)
    )
# A tibble: 2 × 3
  sex   total_releases uncond_releases
  <chr>          <dbl>           <dbl>
1 f               8444            2091
2 m              51873           19474

It’s useful to think through what happens here when you add group_by(sex). The underlying data frame isn’t changed, but rather the rows are grouped by sex and then the summary operation is performed on all the rows where sex is f and then on all the rows where sex is m.

year state_name state_abbr sex rel_total rel_uncond rel_cond rel_oth
2010 Nevada NV m 5248 1825 3382 41
2010 Nevada NV f 788 200 587 1
2011 Nevada NV m 5161 1983 3141 37
2011 Nevada NV f 749 159 589 1
2012 Nevada NV m 4919 1864 2952 35
2012 Nevada NV f 704 150 554 1
2013 Nevada NV m 4919 1864 2952 35
2013 Nevada NV f 704 150 554 1
2014 Nevada NV m 5041 1901 2745 395
2014 Nevada NV f 797 206 585 6
2015 Nevada NV m 5041 1901 2745 395
2015 Nevada NV f 797 206 585 6
2016 Nevada NV m 4963 2236 2684 43
2016 Nevada NV f 815 289 525 1
2017 Nevada NV m 5472 2154 3273 45
2017 Nevada NV f 1076 247 827 2
2018 Nevada NV m 5431 1818 3568 45
2018 Nevada NV f 1003 231 771 1
2019 Nevada NV m 5678 1928 3707 43
2019 Nevada NV f 1011 253 758 0

This color-coded table shows the two groups in the original dataset that are then summed. Adding all the green rows gives you the total number of women that were released, and adding all the orange rows gives you the total number of men that were released.

sex total_releases uncond_releases
f 8444 2091
m 51873 19474

Using this same setup, you can calculate the number of releases by year rather than by sex if you change the grouping variable from sex to year.

nv_2010s |> 
  group_by(year) |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond)
    )
# A tibble: 10 × 3
    year total_releases uncond_releases
   <dbl>          <dbl>           <dbl>
 1  2010           6036            2025
 2  2011           5910            2142
 3  2012           5623            2014
 4  2013           5623            2014
 5  2014           5838            2107
 6  2015           5838            2107
 7  2016           5778            2525
 8  2017           6548            2401
 9  2018           6434            2049
10  2019           6689            2181

Multiple grouping columns

You can also group by more than one column to create additional groups. In the original nps_release dataset, there are rows for each combination of year, state, and sex. To find the total number of releases in the 2010s by state and sex, add both state_name and sex as grouping variables.

nps_release |> 
  filter(between(year, 2010, 2019)) |> 
  group_by(state_name, sex) |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond),
    ) |>  
  ungroup()
# A tibble: 102 × 4
   state_name sex   total_releases uncond_releases
   <chr>      <chr>          <dbl>           <dbl>
 1 Alabama    f              18381            5184
 2 Alabama    m             110721           31829
 3 Alaska     f               5314            1149
 4 Alaska     m              24387            4730
 5 Arizona    f              15902            2085
 6 Arizona    m             119039           19893
 7 Arkansas   f              10643             483
 8 Arkansas   m              73592            5258
 9 California f              43385            4637
10 California m             495128           40595
# ℹ 92 more rows

The resulting data frame has 102 rows, or one row for each combination of state and sex (51 states x 2 sexes).

Note that the code above includes ungroup() at the end of the code chunk. This is important because if you don’t ungroup a grouped data frame, the next functions you run will be on each group of the data frame rather than the whole. Generally, it’s good practice to ungroup() after every group_by()/summarize() statement.

To do further exploratory data analysis with this example, you can calculate the percentage of total releases that were unconditional and then find which states released the most people from prison without conditions.

nps_release |> 
  filter(between(year, 2010, 2019)) |> 
  group_by(state_name, sex) |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond),
    ) |>  
  ungroup() |> 
  mutate(pct_uncond = uncond_releases / total_releases) |> 
  arrange(desc(pct_uncond))
# A tibble: 102 × 5
   state_name    sex   total_releases uncond_releases pct_uncond
   <chr>         <chr>          <dbl>           <dbl>      <dbl>
 1 Massachusetts m              23108           16840      0.729
 2 Florida       f              37341           25115      0.673
 3 Rhode Island  m               8601            5677      0.660
 4 Florida       m             283965          176464      0.621
 5 Rhode Island  f                550             327      0.595
 6 Ohio          f              29563           16789      0.568
 7 Massachusetts f               3810            2149      0.564
 8 Maine         m               8135            4531      0.557
 9 New Jersey    m              96151           51509      0.536
10 Connecticut   m              53682           26782      0.499
# ℹ 92 more rows

Between 2010 and 2019, nearly 73 percent of men released from prison in Massachusetts were released without conditions.

Resources