library(tidyverse)
library(readxl)
nps_release <- read_excel("data/nps-releases.xlsx")Lesson 12: Summarize Data
Introduction
So far, you’ve learned how to remove and sort rows of a data frame and rename, remove, and create columns of a data frame. This lesson covers another key data frame manipulation: aggregation or summarization of rows of a data frame. Aggregation is important because it allows you to calculate summary statistics and otherwise reduce or combine observations.
The examples you’ll work on in this lesson use the same nps_release data frame that you’ve worked with in previous lessons. Read in nps-releases.xlsx with the following code:
Next, create a subset of the dataset—releases from Nevada during the 2010s:
nv_2010s <- nps_release |>
filter(state_abbr == "NV", between(year, 2010, 2019))A nice trick you can use to filter for a range of years (or any numerical variable!) is the between() function. The parameters for between() are the variable name, the lower (or left) bound, and the upper (or right) bound. Note that between() includes both the upper and lower bound, so in this case 2010 and 2019 will be included in the filtered dataset.
nv_2010s# A tibble: 20 × 8
year state_name state_abbr sex rel_total rel_uncond rel_cond rel_oth
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 2010 Nevada NV m 5248 1825 3382 41
2 2010 Nevada NV f 788 200 587 1
3 2011 Nevada NV m 5161 1983 3141 37
4 2011 Nevada NV f 749 159 589 1
5 2012 Nevada NV m 4919 1864 2952 35
6 2012 Nevada NV f 704 150 554 1
7 2013 Nevada NV m 4919 1864 2952 35
8 2013 Nevada NV f 704 150 554 1
9 2014 Nevada NV m 5041 1901 2745 395
10 2014 Nevada NV f 797 206 585 6
11 2015 Nevada NV m 5041 1901 2745 395
12 2015 Nevada NV f 797 206 585 6
13 2016 Nevada NV m 4963 2236 2684 43
14 2016 Nevada NV f 815 289 525 1
15 2017 Nevada NV m 5472 2154 3273 45
16 2017 Nevada NV f 1076 247 827 2
17 2018 Nevada NV m 5431 1818 3568 45
18 2018 Nevada NV f 1003 231 771 1
19 2019 Nevada NV m 5678 1928 3707 43
20 2019 Nevada NV f 1011 253 758 0
summarize()
The main function to aggregate or reduce rows in a data frame is summarize(). Summarize is different from other functions that operate on rows like filter() and arrange(), since it changes the data in a data frame rather than removing or sorting existing rows. In its simplest form, summarize() can be used to calculate summary statistics for a data frame.
The British spelling, summarise(), works exactly the same as summarize().
As an example, start with the Nevada releases data frame from above and calculate the number of total releases throughout the 2010s.
nv_2010s |>
summarize(total_releases = sum(rel_total))# A tibble: 1 × 1
total_releases
<dbl>
1 60317
In this example, you created a new variable called total_releases that is the sum of the rel_total column. Between 2010 and 2019 in Nevada, 60,317 people were released from prison.
sum() is not the only function you can use within summarize—any function that reduces a set of values to a single value will work. For instance, you can calculate the average number of total releases per year using the mean() function.
nv_2010s |>
summarize(total_releases_avg = mean(rel_total))# A tibble: 1 × 1
total_releases_avg
<dbl>
1 3016.
You can also calculate more than one summary statistic in a single summarize() call. Here, for example, sum the number of total releases and number of unconditional releases.
nv_2010s |>
summarize(
total_releases = sum(rel_total),
uncond_releases = sum(rel_uncond)
)# A tibble: 1 × 2
total_releases uncond_releases
<dbl> <dbl>
1 60317 21565
group_by()
While summarize() is useful on its own, it is often used in combination with the group_by() function to calculate summary statistics by group. Recall that the releases dataset contains a sex column, which separates the number of releases by men and women. If you take the same summarization code from above and add group_by(sex) before summarize(), you get the sum of total releases and unconditional releases, but with one row for women and one row for men.
nv_2010s |>
group_by(sex) |>
summarize(
total_releases = sum(rel_total),
uncond_releases = sum(rel_uncond)
)# A tibble: 2 × 3
sex total_releases uncond_releases
<chr> <dbl> <dbl>
1 f 8444 2091
2 m 51873 19474
It’s useful to think through what happens here when you add group_by(sex). The underlying data frame isn’t changed, but rather the rows are grouped by sex and then the summary operation is performed on all the rows where sex is f and then on all the rows where sex is m.
| year | state_name | state_abbr | sex | rel_total | rel_uncond | rel_cond | rel_oth |
|---|---|---|---|---|---|---|---|
| 2010 | Nevada | NV | m | 5248 | 1825 | 3382 | 41 |
| 2010 | Nevada | NV | f | 788 | 200 | 587 | 1 |
| 2011 | Nevada | NV | m | 5161 | 1983 | 3141 | 37 |
| 2011 | Nevada | NV | f | 749 | 159 | 589 | 1 |
| 2012 | Nevada | NV | m | 4919 | 1864 | 2952 | 35 |
| 2012 | Nevada | NV | f | 704 | 150 | 554 | 1 |
| 2013 | Nevada | NV | m | 4919 | 1864 | 2952 | 35 |
| 2013 | Nevada | NV | f | 704 | 150 | 554 | 1 |
| 2014 | Nevada | NV | m | 5041 | 1901 | 2745 | 395 |
| 2014 | Nevada | NV | f | 797 | 206 | 585 | 6 |
| 2015 | Nevada | NV | m | 5041 | 1901 | 2745 | 395 |
| 2015 | Nevada | NV | f | 797 | 206 | 585 | 6 |
| 2016 | Nevada | NV | m | 4963 | 2236 | 2684 | 43 |
| 2016 | Nevada | NV | f | 815 | 289 | 525 | 1 |
| 2017 | Nevada | NV | m | 5472 | 2154 | 3273 | 45 |
| 2017 | Nevada | NV | f | 1076 | 247 | 827 | 2 |
| 2018 | Nevada | NV | m | 5431 | 1818 | 3568 | 45 |
| 2018 | Nevada | NV | f | 1003 | 231 | 771 | 1 |
| 2019 | Nevada | NV | m | 5678 | 1928 | 3707 | 43 |
| 2019 | Nevada | NV | f | 1011 | 253 | 758 | 0 |
This color-coded table shows the two groups in the original dataset that are then summed. Adding all the green rows gives you the total number of women that were released, and adding all the orange rows gives you the total number of men that were released.
| sex | total_releases | uncond_releases |
|---|---|---|
| f | 8444 | 2091 |
| m | 51873 | 19474 |
Using this same setup, you can calculate the number of releases by year rather than by sex if you change the grouping variable from sex to year.
nv_2010s |>
group_by(year) |>
summarize(
total_releases = sum(rel_total),
uncond_releases = sum(rel_uncond)
)# A tibble: 10 × 3
year total_releases uncond_releases
<dbl> <dbl> <dbl>
1 2010 6036 2025
2 2011 5910 2142
3 2012 5623 2014
4 2013 5623 2014
5 2014 5838 2107
6 2015 5838 2107
7 2016 5778 2525
8 2017 6548 2401
9 2018 6434 2049
10 2019 6689 2181
Multiple grouping columns
You can also group by more than one column to create additional groups. In the original nps_release dataset, there are rows for each combination of year, state, and sex. To find the total number of releases in the 2010s by state and sex, add both state_name and sex as grouping variables.
nps_release |>
filter(between(year, 2010, 2019)) |>
group_by(state_name, sex) |>
summarize(
total_releases = sum(rel_total),
uncond_releases = sum(rel_uncond),
) |>
ungroup()# A tibble: 102 × 4
state_name sex total_releases uncond_releases
<chr> <chr> <dbl> <dbl>
1 Alabama f 18381 5184
2 Alabama m 110721 31829
3 Alaska f 5314 1149
4 Alaska m 24387 4730
5 Arizona f 15902 2085
6 Arizona m 119039 19893
7 Arkansas f 10643 483
8 Arkansas m 73592 5258
9 California f 43385 4637
10 California m 495128 40595
# ℹ 92 more rows
The resulting data frame has 102 rows, or one row for each combination of state and sex (51 states x 2 sexes).
Note that the code above includes ungroup() at the end of the code chunk. This is important because if you don’t ungroup a grouped data frame, the next functions you run will be on each group of the data frame rather than the whole. Generally, it’s good practice to ungroup() after every group_by()/summarize() statement.
To do further exploratory data analysis with this example, you can calculate the percentage of total releases that were unconditional and then find which states released the most people from prison without conditions.
nps_release |>
filter(between(year, 2010, 2019)) |>
group_by(state_name, sex) |>
summarize(
total_releases = sum(rel_total),
uncond_releases = sum(rel_uncond),
) |>
ungroup() |>
mutate(pct_uncond = uncond_releases / total_releases) |>
arrange(desc(pct_uncond))# A tibble: 102 × 5
state_name sex total_releases uncond_releases pct_uncond
<chr> <chr> <dbl> <dbl> <dbl>
1 Massachusetts m 23108 16840 0.729
2 Florida f 37341 25115 0.673
3 Rhode Island m 8601 5677 0.660
4 Florida m 283965 176464 0.621
5 Rhode Island f 550 327 0.595
6 Ohio f 29563 16789 0.568
7 Massachusetts f 3810 2149 0.564
8 Maine m 8135 4531 0.557
9 New Jersey m 96151 51509 0.536
10 Connecticut m 53682 26782 0.499
# ℹ 92 more rows
Between 2010 and 2019, nearly 73 percent of men released from prison in Massachusetts were released without conditions.
Resources
- R for Data Science (2e), Chapter 3 Data Transformation: https://r4ds.hadley.nz/data-transform#groups
- Statistical Inference via Data Science: A ModernDive into R and the Tidyverse!, Second Edition, Chapter 3: https://moderndive.com/3-wrangling.html#summarize