library(tidyverse)
nps_admit <- read_csv(path = "data/nps-admissions.csv")Lesson 8: Filter and Sort Rows
Introduction
Filtering data—keeping rows in a data frame that match a certain condition—is one of the most common tasks for data analysis. More often than not, the data you’re working with includes rows that are not of interest to the task at hand. Whether you’re looking at a certain time period or a specific population, using the filter() function will help you subset the data into the rows for your analysis.
Another crucial task for data analysis is sorting, or ordering the rows of a data frame in a specific way. Generally, you’ll want to sort rows based on the values of one or more columns, such as by year or group. You can sort the rows of a data frame with the arrange() function.
To demonstrate how filter() and arrange() work, you will use a dataset from previous lessons, National Prison Statistics admissions data. This data can be read in from a CSV file on your computer (described in Lesson 5) using the following code:
filter()
The filter function requires two arguments:
.datais the data frame you want to filter....is one or more conditions on which to filter the data frame.
For example, you can take the nps_admit dataset and filter to keep only the rows where the state is Minnesota.
filter(nps_admit, state_abbr == "MN")# A tibble: 225 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Minnesota MN adm_total 1284 89
2 1978 Minnesota MN adm_new_commit 840 65
3 1978 Minnesota MN adm_viol_new 122 3
4 1978 Minnesota MN adm_viol_tech 287 3
5 1978 Minnesota MN adm_oth 26 18
6 1979 Minnesota MN adm_total 1332 78
7 1979 Minnesota MN adm_new_commit 884 59
8 1979 Minnesota MN adm_viol_new 136 0
9 1979 Minnesota MN adm_viol_tech 280 8
10 1979 Minnesota MN adm_oth 32 11
# ℹ 215 more rows
Now, you can see that the dataset only contains 225 rows, and the state is always Minnesota. If you want to use the subset of the data in future analysis, you might also consider assigning this smaller dataset to a new object. See Lesson 7 for more information on object assignment.
mn_admit <- filter(nps_admit, state_abbr == "MN") Operators
You can filter a data frame on many conditions and you define these conditions using operators. Operators are symbols that represent specific logical or relational processes such as greater than, less than, or equal to. You can use these operators alone or in combination to achieve the filtered data set you need.
Here are the most common operators used to filter data in R.
| Operator | Meaning | Examples |
|---|---|---|
== |
equal to | filter(nps_admit, year == 2018) Keep rows where the year is equal to 2018 filter(nps_admit, adm_type == "adm_total") Keep rows where the admission type is total admissions |
!= |
not equal to | filter(nps_admit, year != 2018) Keep rows where the year is not 2018 filter(nps_admit, adm_type != "adm_total") Keep rows where the admission type is not total admissions |
> |
greater than | filter(nps_admit, year > 2018)Keep rows where the year is greater than 2018 (2019-2022) |
>= |
greater than or equal to | filter(nps_admit, year >= 2018) Keep rows where the year is greater than or equal to 2018 (2018-2022) |
< |
less than | filter(nps_admit, year < 2018) Keep rows where the year is less than 2018 (1978-2017) |
<= |
less than or equal to | filter(nps_admit, year <= 2018) Keep rows where the year is less than or equal to 2018 (1978-2018) |
%in% |
in, included in set | filter(nps_admit, year %in% c(2018, 2019, 2020)) Keep rows where the year is one of 2018, 2019, 2020 filter(nps_admit, adm_type %in% c("adm_viol_new", "adm_viol_tech")) Keep rows of where the admission type is new offense violations or technical violations |
! |
not, exclude rows that match condition | filter(nps_admit, !adm_type %in% c("adm_viol_new", "adm_viol_tech")) Keep rows where the admission type is not new offense violations or technical violations |
When using the “is equal to” operator, ==, you must use a double equal sign. If you use a single equals sign, you will get an error. If you try to run the code below, you will get an error message. The error message below catches the issues and asks if you meant to use the correct operator, the double equal sign.
filter(nps_admit, year = 2020)
#> Error in `filter()`:
#> ! We detected a named input.
#> ℹ This usually means that you've used `=` instead of `==`.
#> ℹ Did you mean `year == 2020`?
#> Backtrace:
#> 1. dplyr::filter(nps_admit, year = 2020)
#> 2. dplyr:::filter.data.frame(nps_admit, year = 2020)filter(nps_admit, year == 2020)# A tibble: 255 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2020 Alabama AL adm_total 7154 1476
2 2020 Alabama AL adm_new_commit 4265 922
3 2020 Alabama AL adm_viol_new 159 16
4 2020 Alabama AL adm_viol_tech 633 44
5 2020 Alabama AL adm_oth 1744 494
6 2020 Alaska AK adm_total 1436 228
7 2020 Alaska AK adm_new_commit 1366 216
8 2020 Alaska AK adm_viol_new 0 0
9 2020 Alaska AK adm_viol_tech 70 12
10 2020 Alaska AK adm_oth 0 0
# ℹ 245 more rows
Combinations of conditions
Many times, you’ll want to use more than one condition when filtering your data. Conditions can be combined with “and” statements as well as “or” statements.
“And” statements mean that all conditions must be true for the row to be retained. To combine multiple conditions on an “and” basis, use either an ampersand, &, or a comma, ,. The examples below demonstrate keeping rows where both male (m) and female (f) counts are greater than 100.
filter(nps_admit, m > 100, f > 100)filter(nps_admit, m > 100 & f > 100)# A tibble: 4,917 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1979 Alabama AL adm_total 2596 223
4 1979 Alabama AL adm_new_commit 2314 178
5 1980 Alabama AL adm_total 3584 190
6 1980 Alabama AL adm_new_commit 3569 190
7 1981 Alabama AL adm_total 3773 252
8 1981 Alabama AL adm_new_commit 3755 251
9 1982 Alabama AL adm_total 4168 305
10 1982 Alabama AL adm_new_commit 3982 301
# ℹ 4,907 more rows
“Or” statements mean that either condition can be true for the row to be retained. To combine multiple conditions on an “or” basis, use a vertical line, |. The example below demonstrates keeping rows if either male (m) or female (f) counts are greater than 100. Notice that this filtering includes 3,828 more rows than the previous example.
filter(nps_admit, m > 100 | f > 100)# A tibble: 8,745 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1978 Alabama AL adm_viol_tech 150 5
4 1978 Alabama AL adm_oth 155 31
5 1979 Alabama AL adm_total 2596 223
6 1979 Alabama AL adm_new_commit 2314 178
7 1979 Alabama AL adm_oth 214 43
8 1980 Alabama AL adm_total 3584 190
9 1980 Alabama AL adm_new_commit 3569 190
10 1981 Alabama AL adm_total 3773 252
# ℹ 8,735 more rows
arrange()
Once you’re finished filtering your data frame, you’ll likely want to arrange or sort the rows. Arranging data in a meaningful way can make a table easier to read for the audience. Similar to filter(), the arrange() function starts with a data frame and then follows with a list of conditions for sorting.
.datais the data frame you want to sort....is one or more conditions on which to sort the data frame.
Unlike the filter() function, the arrange() function does not change how many rows are in a data frame, the rows are just ordered in a different way.
To demonstrate the arrange() function, first use the filter() function to create a data set that only includes total prison admissions counts.
total_prison_adm <- filter(nps_admit, adm_type == "adm_total")
total_prison_adm# A tibble: 2,295 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1979 Alabama AL adm_total 2596 223
3 1980 Alabama AL adm_total 3584 190
4 1981 Alabama AL adm_total 3773 252
5 1982 Alabama AL adm_total 4168 305
6 1983 Alabama AL adm_total 4336 326
7 1984 Alabama AL adm_total 4393 362
8 1985 Alabama AL adm_total 4050 357
9 1986 Alabama AL adm_total 3973 311
10 1987 Alabama AL adm_total 4499 344
# ℹ 2,285 more rows
Notice that the total_prison_adm dataset has 2,295 rows. Now, you can take the total_prison_adm dataset and sort it by the m column.
arrange(total_prison_adm, m)# A tibble: 2,295 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1979 North Dakota ND adm_total 144 4
2 1978 North Dakota ND adm_total 147 6
3 1980 North Dakota ND adm_total 191 4
4 1979 Hawaii HI adm_total 197 9
5 1978 Hawaii HI adm_total 198 7
6 1981 North Dakota ND adm_total 206 2
7 1978 New Hampshire NH adm_total 208 6
8 1979 New Hampshire NH adm_total 210 3
9 1980 New Hampshire NH adm_total 211 5
10 1982 Hawaii HI adm_total 215 16
# ℹ 2,285 more rows
You see that the smallest values in the m column are displayed first. By default, numeric columns are sorted in ascending order, from smallest to largest. Notice the data set still has 2,295 rows, the rows are just in a different order than before.
You can also sort by character columns. In this case, sort the nps_admit dataset by the adm_type column.
arrange(nps_admit, adm_type)# A tibble: 11,475 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_new_commit 2115 148
2 1979 Alabama AL adm_new_commit 2314 178
3 1980 Alabama AL adm_new_commit 3569 190
4 1981 Alabama AL adm_new_commit 3755 251
5 1982 Alabama AL adm_new_commit 3982 301
6 1983 Alabama AL adm_new_commit 4103 316
7 1984 Alabama AL adm_new_commit 4108 340
8 1985 Alabama AL adm_new_commit 3795 338
9 1986 Alabama AL adm_new_commit 3088 249
10 1987 Alabama AL adm_new_commit 3547 277
# ℹ 11,465 more rows
You see that all the rows with adm_new_commit are displayed first. By default, character columns are sorted in alphabetical order.
To reverse the sort order, wrap the column you want to sort by in the desc() function, which will sort numeric variables from largest to smallest, and character variables in reverse alphabetical order.
arrange(total_prison_adm, desc(m))# A tibble: 2,295 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2008 California CA adm_total 126929 14070
2 2007 California CA adm_total 125965 13809
3 2006 California CA adm_total 124998 13699
4 1998 California CA adm_total 121235 13250
5 1997 California CA adm_total 119919 12604
6 1999 California CA adm_total 118745 12843
7 2000 California CA adm_total 117571 12609
8 2009 California CA adm_total 117286 12584
9 2005 California CA adm_total 116950 12784
10 2001 California CA adm_total 115708 11620
# ℹ 2,285 more rows
arrange(nps_admit, desc(adm_type))# A tibble: 11,475 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_viol_tech 150 5
2 1979 Alabama AL adm_viol_tech 68 2
3 1980 Alabama AL adm_viol_tech 1 0
4 1981 Alabama AL adm_viol_tech 2 0
5 1982 Alabama AL adm_viol_tech 0 0
6 1983 Alabama AL adm_viol_tech NA NA
7 1984 Alabama AL adm_viol_tech NA NA
8 1985 Alabama AL adm_viol_tech NA NA
9 1986 Alabama AL adm_viol_tech 396 19
10 1987 Alabama AL adm_viol_tech 477 28
# ℹ 11,465 more rows
Sort by multiple columns
You will usually want to arrange the data frame by multiple conditions. The arrange function evaluates conditions in order.
arrange(df, sort_var_1, sort_var_2) The data frame will first be sorted by sort_var_1. In instances where rows have the same sort_var_1 value, those rows will be sorted by sort_var_2.
Let’s say you want to arrange the data frame by the following:
- Admission type, alphabetically
- Descending year (most recent year first)
- State name, alphabetically
You can achieve this order by combining three sorting variables.
arrange(nps_admit, adm_type, desc(year), state_name)# A tibble: 11,475 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2022 Alabama AL adm_new_commit 6116 1247
2 2022 Alaska AK adm_new_commit 1157 292
3 2022 Arizona AZ adm_new_commit 7979 1086
4 2022 Arkansas AR adm_new_commit 3484 698
5 2022 California CA adm_new_commit 22753 1588
6 2022 Colorado CO adm_new_commit 4186 678
7 2022 Connecticut CT adm_new_commit 2704 220
8 2022 Delaware DE adm_new_commit 1449 245
9 2022 District of Columbia DC adm_new_commit NA NA
10 2022 Florida FL adm_new_commit 24626 3049
# ℹ 11,465 more rows
The pipe
Filtering and arranging data requires two separate function calls (filter() and arrange()). To combine multiple functions in a single chunk of code, use the pipe, |> to link and combine function calls.
For example, you might want to create a dataset that only contains data from 2019 and later and is sorted by admission type and year (in descending order). To do this, combine filtering and arranging operations with the |>.
nps_admit |>
filter(year >= 2019) |>
arrange(adm_type, desc(year))# A tibble: 1,020 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2022 Alabama AL adm_new_commit 6116 1247
2 2022 Alaska AK adm_new_commit 1157 292
3 2022 Arizona AZ adm_new_commit 7979 1086
4 2022 Arkansas AR adm_new_commit 3484 698
5 2022 California CA adm_new_commit 22753 1588
6 2022 Colorado CO adm_new_commit 4186 678
7 2022 Connecticut CT adm_new_commit 2704 220
8 2022 Delaware DE adm_new_commit 1449 245
9 2022 District of Columbia DC adm_new_commit NA NA
10 2022 Florida FL adm_new_commit 24626 3049
# ℹ 1,010 more rows
It can be helpful to read a chunk of code like this out loud:
- Take the
nps_admitdata frame and then… - Filter it to include rows where the year is greater than or equal to 2019 and then…
- Arrange it by
adm_typeand in descending order byyear.
When verbalizing this code, the pipe—|>— is read as “and then,” signifying the linking of operations on the starting data frame.
Don’t forget to assign
For functions that operate on data frames like filter() and arrange(), unless you assign the modified data frame to a new object, the result only prints in the console rather than modifying the original data frame. Even if you’ve executed code to filter and arrange the data frame, if you print nps_admit, you won’t see any changes.
nps_admit# A tibble: 11,475 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1978 Alabama AL adm_viol_new 0 0
4 1978 Alabama AL adm_viol_tech 150 5
5 1978 Alabama AL adm_oth 155 31
6 1979 Alabama AL adm_total 2596 223
7 1979 Alabama AL adm_new_commit 2314 178
8 1979 Alabama AL adm_viol_new 0 0
9 1979 Alabama AL adm_viol_tech 68 2
10 1979 Alabama AL adm_oth 214 43
# ℹ 11,465 more rows
This is because you never assigned the modified data frame to a new object. Create a new data frame called nps_admit_filtered that is the filtered and sorted data frame if you want to save and re-use the modified data frame.
nps_admit_filtered <- nps_admit |>
filter(year >= 2019) |>
arrange(adm_type, desc(year))
nps_admit_filtered# A tibble: 1,020 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2022 Alabama AL adm_new_commit 6116 1247
2 2022 Alaska AK adm_new_commit 1157 292
3 2022 Arizona AZ adm_new_commit 7979 1086
4 2022 Arkansas AR adm_new_commit 3484 698
5 2022 California CA adm_new_commit 22753 1588
6 2022 Colorado CO adm_new_commit 4186 678
7 2022 Connecticut CT adm_new_commit 2704 220
8 2022 Delaware DE adm_new_commit 1449 245
9 2022 District of Columbia DC adm_new_commit NA NA
10 2022 Florida FL adm_new_commit 24626 3049
# ℹ 1,010 more rows