Lesson 12: Summarize Data

Introduction

So far, you’ve learned how to remove and sort rows of a data frame and rename, remove, and create columns of a data frame. This lesson covers another key data frame manipulation: aggregation or summarization of rows of a data frame. Aggregation is important because it allows you to calculate summary statistics and otherwise reduce or combine observations.

The examples you’ll work on in this lesson use the same nps_release data frame that you’ve worked with in previous lessons. Read in nps-releases.xlsx with the following code:

library(tidyverse)
library(readxl)

nps_release <- read_excel("data/nps-releases.xlsx")

Next, create a subset of the dataset—releases from Nevada during the 2010s:

nv_2010s <- nps_release |> 
  filter(state_abbr == "NV", between(year, 2010, 2019))

A nice trick you can use to filter for a range of years (or any numerical variable!) is the between() function. The parameters for between() are the variable name, the lower (or left) bound, and the upper (or right) bound. Note that between() includes both the upper and lower bound, so in this case 2010 and 2019 will be included in the filtered dataset.

nv_2010s

# A tibble: 20 × 8
    year state_name state_abbr sex   rel_total rel_uncond rel_cond rel_oth
   <dbl> <chr>      <chr>      <chr>     <dbl>      <dbl>    <dbl>   <dbl>
 1  2010 Nevada     NV         m          5248       1825     3382      41
 2  2010 Nevada     NV         f           788        200      587       1
 3  2011 Nevada     NV         m          5161       1983     3141      37
 4  2011 Nevada     NV         f           749        159      589       1
 5  2012 Nevada     NV         m          4919       1864     2952      35
 6  2012 Nevada     NV         f           704        150      554       1
 7  2013 Nevada     NV         m          4919       1864     2952      35
 8  2013 Nevada     NV         f           704        150      554       1
 9  2014 Nevada     NV         m          5041       1901     2745     395
10  2014 Nevada     NV         f           797        206      585       6
11  2015 Nevada     NV         m          5041       1901     2745     395
12  2015 Nevada     NV         f           797        206      585       6
13  2016 Nevada     NV         m          4963       2236     2684      43
14  2016 Nevada     NV         f           815        289      525       1
15  2017 Nevada     NV         m          5472       2154     3273      45
16  2017 Nevada     NV         f          1076        247      827       2
17  2018 Nevada     NV         m          5431       1818     3568      45
18  2018 Nevada     NV         f          1003        231      771       1
19  2019 Nevada     NV         m          5678       1928     3707      43
20  2019 Nevada     NV         f          1011        253      758       0

`summarize()`

The main function to aggregate or reduce rows in a data frame is summarize(). Summarize is different from other functions that operate on rows like filter() and arrange(), since it changes the data in a data frame rather than removing or sorting existing rows. In its simplest form, summarize() can be used to calculate summary statistics for a data frame.

The British spelling, summarise(), works exactly the same as summarize().

As an example, start with the Nevada releases data frame from above and calculate the number of total releases throughout the 2010s.

nv_2010s |> 
  summarize(total_releases = sum(rel_total))

# A tibble: 1 × 1
  total_releases
           <dbl>
1          60317

In this example, you created a new variable called total_releases that is the sum of the rel_total column. Between 2010 and 2019 in Nevada, 60,317 people were released from prison.

sum() is not the only function you can use within summarize—any function that reduces a set of values to a single value will work. For instance, you can calculate the average number of total releases per year using the mean() function.

nv_2010s |> 
  summarize(total_releases_avg = mean(rel_total))

# A tibble: 1 × 1
  total_releases_avg
               <dbl>
1              3016.

You can also calculate more than one summary statistic in a single summarize() call. Here, for example, sum the number of total releases and number of unconditional releases.

nv_2010s |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond)
    )

# A tibble: 1 × 2
  total_releases uncond_releases
           <dbl>           <dbl>
1          60317           21565

`group_by()`

While summarize() is useful on its own, it is often used in combination with the group_by() function to calculate summary statistics by group. Recall that the releases dataset contains a sex column, which separates the number of releases by men and women. If you take the same summarization code from above and add group_by(sex) before summarize(), you get the sum of total releases and unconditional releases, but with one row for women and one row for men.

nv_2010s |> 
  group_by(sex) |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond)
    )

# A tibble: 2 × 3
  sex   total_releases uncond_releases
  <chr>          <dbl>           <dbl>
1 f               8444            2091
2 m              51873           19474

It’s useful to think through what happens here when you add group_by(sex). The underlying data frame isn’t changed, but rather the rows are grouped by sex and then the summary operation is performed on all the rows where sex is f and then on all the rows where sex is m.

year	state_name	state_abbr	sex	rel_total	rel_uncond	rel_cond	rel_oth
2010	Nevada	NV	m	5248	1825	3382	41
2010	Nevada	NV	f	788	200	587	1
2011	Nevada	NV	m	5161	1983	3141	37
2011	Nevada	NV	f	749	159	589	1
2012	Nevada	NV	m	4919	1864	2952	35
2012	Nevada	NV	f	704	150	554	1
2013	Nevada	NV	m	4919	1864	2952	35
2013	Nevada	NV	f	704	150	554	1
2014	Nevada	NV	m	5041	1901	2745	395
2014	Nevada	NV	f	797	206	585	6
2015	Nevada	NV	m	5041	1901	2745	395
2015	Nevada	NV	f	797	206	585	6
2016	Nevada	NV	m	4963	2236	2684	43
2016	Nevada	NV	f	815	289	525	1
2017	Nevada	NV	m	5472	2154	3273	45
2017	Nevada	NV	f	1076	247	827	2
2018	Nevada	NV	m	5431	1818	3568	45
2018	Nevada	NV	f	1003	231	771	1
2019	Nevada	NV	m	5678	1928	3707	43
2019	Nevada	NV	f	1011	253	758	0

This color-coded table shows the two groups in the original dataset that are then summed. Adding all the green rows gives you the total number of women that were released, and adding all the orange rows gives you the total number of men that were released.

sex	total_releases	uncond_releases
f	8444	2091
m	51873	19474

Using this same setup, you can calculate the number of releases by year rather than by sex if you change the grouping variable from sex to year.

nv_2010s |> 
  group_by(year) |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond)
    )

# A tibble: 10 × 3
    year total_releases uncond_releases
   <dbl>          <dbl>           <dbl>
 1  2010           6036            2025
 2  2011           5910            2142
 3  2012           5623            2014
 4  2013           5623            2014
 5  2014           5838            2107
 6  2015           5838            2107
 7  2016           5778            2525
 8  2017           6548            2401
 9  2018           6434            2049
10  2019           6689            2181

Multiple grouping columns

You can also group by more than one column to create additional groups. In the original nps_release dataset, there are rows for each combination of year, state, and sex. To find the total number of releases in the 2010s by state and sex, add both state_name and sex as grouping variables.

nps_release |> 
  filter(between(year, 2010, 2019)) |> 
  group_by(state_name, sex) |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond),
    ) |>  
  ungroup()

# A tibble: 102 × 4
   state_name sex   total_releases uncond_releases
   <chr>      <chr>          <dbl>           <dbl>
 1 Alabama    f              18381            5184
 2 Alabama    m             110721           31829
 3 Alaska     f               5314            1149
 4 Alaska     m              24387            4730
 5 Arizona    f              15902            2085
 6 Arizona    m             119039           19893
 7 Arkansas   f              10643             483
 8 Arkansas   m              73592            5258
 9 California f              43385            4637
10 California m             495128           40595
# ℹ 92 more rows

The resulting data frame has 102 rows, or one row for each combination of state and sex (51 states x 2 sexes).

Note that the code above includes ungroup() at the end of the code chunk. This is important because if you don’t ungroup a grouped data frame, the next functions you run will be on each group of the data frame rather than the whole. Generally, it’s good practice to ungroup() after every group_by()/summarize() statement.

To do further exploratory data analysis with this example, you can calculate the percentage of total releases that were unconditional and then find which states released the most people from prison without conditions.

nps_release |> 
  filter(between(year, 2010, 2019)) |> 
  group_by(state_name, sex) |> 
  summarize(
    total_releases = sum(rel_total),
    uncond_releases = sum(rel_uncond),
    ) |>  
  ungroup() |> 
  mutate(pct_uncond = uncond_releases / total_releases) |> 
  arrange(desc(pct_uncond))

# A tibble: 102 × 5
   state_name    sex   total_releases uncond_releases pct_uncond
   <chr>         <chr>          <dbl>           <dbl>      <dbl>
 1 Massachusetts m              23108           16840      0.729
 2 Florida       f              37341           25115      0.673
 3 Rhode Island  m               8601            5677      0.660
 4 Florida       m             283965          176464      0.621
 5 Rhode Island  f                550             327      0.595
 6 Ohio          f              29563           16789      0.568
 7 Massachusetts f               3810            2149      0.564
 8 Maine         m               8135            4531      0.557
 9 New Jersey    m              96151           51509      0.536
10 Connecticut   m              53682           26782      0.499
# ℹ 92 more rows

Between 2010 and 2019, nearly 73 percent of men released from prison in Massachusetts were released without conditions.

Resources

R for Data Science (2e), Chapter 3 Data Transformation: https://r4ds.hadley.nz/data-transform#groups
Statistical Inference via Data Science: A ModernDive into R and the Tidyverse!, Second Edition, Chapter 3: https://moderndive.com/3-wrangling.html#summarize