Lesson 8: Filter and Sort Rows

Introduction

Filtering data—keeping rows in a data frame that match a certain condition—is one of the most common tasks for data analysis. More often than not, the data you’re working with includes rows that are not of interest to the task at hand. Whether you’re looking at a certain time period or a specific population, using the filter() function will help you subset the data into the rows for your analysis.

Another crucial task for data analysis is sorting, or ordering the rows of a data frame in a specific way. Generally, you’ll want to sort rows based on the values of one or more columns, such as by year or group. You can sort the rows of a data frame with the arrange() function.

To demonstrate how filter() and arrange() work, you will use a dataset from previous lessons, National Prison Statistics admissions data. This data can be read in from a CSV file on your computer (described in Lesson 5) using the following code:

library(tidyverse)

nps_admit <- read_csv(path = "data/nps-admissions.csv")

filter()

The filter function requires two arguments:

  • .data is the data frame you want to filter.
  • ... is one or more conditions on which to filter the data frame.

For example, you can take the nps_admit dataset and filter to keep only the rows where the state is Minnesota.

filter(nps_admit, state_abbr == "MN")
# A tibble: 225 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1978 Minnesota  MN         adm_total       1284    89
 2  1978 Minnesota  MN         adm_new_commit   840    65
 3  1978 Minnesota  MN         adm_viol_new     122     3
 4  1978 Minnesota  MN         adm_viol_tech    287     3
 5  1978 Minnesota  MN         adm_oth           26    18
 6  1979 Minnesota  MN         adm_total       1332    78
 7  1979 Minnesota  MN         adm_new_commit   884    59
 8  1979 Minnesota  MN         adm_viol_new     136     0
 9  1979 Minnesota  MN         adm_viol_tech    280     8
10  1979 Minnesota  MN         adm_oth           32    11
# ℹ 215 more rows

Now, you can see that the dataset only contains 225 rows, and the state is always Minnesota. If you want to use the subset of the data in future analysis, you might also consider assigning this smaller dataset to a new object. See Lesson 7 for more information on object assignment.

mn_admit <- filter(nps_admit, state_abbr == "MN") 

Operators

You can filter a data frame on many conditions and you define these conditions using operators. Operators are symbols that represent specific logical or relational processes such as greater than, less than, or equal to. You can use these operators alone or in combination to achieve the filtered data set you need.

Here are the most common operators used to filter data in R.

Operator Meaning Examples
== equal to filter(nps_admit, year == 2018)
Keep rows where the year is equal to 2018

filter(nps_admit, adm_type == "adm_total")
Keep rows where the admission type is total admissions
!= not equal to filter(nps_admit, year != 2018)
Keep rows where the year is not 2018

filter(nps_admit, adm_type != "adm_total")
Keep rows where the admission type is not total admissions
> greater than filter(nps_admit, year > 2018)
Keep rows where the year is greater than 2018 (2019-2022)
>= greater than or equal to filter(nps_admit, year >= 2018)
Keep rows where the year is greater than or equal to 2018 (2018-2022)
< less than filter(nps_admit, year < 2018)
Keep rows where the year is less than 2018 (1978-2017)
<= less than or equal to filter(nps_admit, year <= 2018)
Keep rows where the year is less than or equal to 2018 (1978-2018)
%in% in, included in set filter(nps_admit, year %in% c(2018, 2019, 2020))
Keep rows where the year is one of 2018, 2019, 2020

filter(nps_admit, adm_type %in% c("adm_viol_new", "adm_viol_tech"))
Keep rows of where the admission type is new offense violations or technical violations
! not, exclude rows that match condition filter(nps_admit, !adm_type %in% c("adm_viol_new", "adm_viol_tech"))
Keep rows where the admission type is not new offense violations or technical violations
Don’t forget the double equal sign!

When using the “is equal to” operator, ==, you must use a double equal sign. If you use a single equals sign, you will get an error. If you try to run the code below, you will get an error message. The error message below catches the issues and asks if you meant to use the correct operator, the double equal sign.

filter(nps_admit, year = 2020)

#> Error in `filter()`:
#> ! We detected a named input.
#> ℹ This usually means that you've used `=` instead of `==`.
#> ℹ Did you mean `year == 2020`?
#> Backtrace:
#>  1. dplyr::filter(nps_admit, year = 2020)
#>  2. dplyr:::filter.data.frame(nps_admit, year = 2020)
filter(nps_admit, year == 2020)
# A tibble: 255 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  2020 Alabama    AL         adm_total       7154  1476
 2  2020 Alabama    AL         adm_new_commit  4265   922
 3  2020 Alabama    AL         adm_viol_new     159    16
 4  2020 Alabama    AL         adm_viol_tech    633    44
 5  2020 Alabama    AL         adm_oth         1744   494
 6  2020 Alaska     AK         adm_total       1436   228
 7  2020 Alaska     AK         adm_new_commit  1366   216
 8  2020 Alaska     AK         adm_viol_new       0     0
 9  2020 Alaska     AK         adm_viol_tech     70    12
10  2020 Alaska     AK         adm_oth            0     0
# ℹ 245 more rows

Combinations of conditions

Many times, you’ll want to use more than one condition when filtering your data. Conditions can be combined with “and” statements as well as “or” statements.

“And” statements mean that all conditions must be true for the row to be retained. To combine multiple conditions on an “and” basis, use either an ampersand, &, or a comma, ,. The examples below demonstrate keeping rows where both male (m) and female (f) counts are greater than 100.

filter(nps_admit, m > 100, f > 100)
filter(nps_admit, m > 100 & f > 100)
# A tibble: 4,917 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1978 Alabama    AL         adm_total       2631   184
 2  1978 Alabama    AL         adm_new_commit  2115   148
 3  1979 Alabama    AL         adm_total       2596   223
 4  1979 Alabama    AL         adm_new_commit  2314   178
 5  1980 Alabama    AL         adm_total       3584   190
 6  1980 Alabama    AL         adm_new_commit  3569   190
 7  1981 Alabama    AL         adm_total       3773   252
 8  1981 Alabama    AL         adm_new_commit  3755   251
 9  1982 Alabama    AL         adm_total       4168   305
10  1982 Alabama    AL         adm_new_commit  3982   301
# ℹ 4,907 more rows

“Or” statements mean that either condition can be true for the row to be retained. To combine multiple conditions on an “or” basis, use a vertical line, |. The example below demonstrates keeping rows if either male (m) or female (f) counts are greater than 100. Notice that this filtering includes 3,828 more rows than the previous example.

filter(nps_admit, m > 100 | f > 100)
# A tibble: 8,745 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1978 Alabama    AL         adm_total       2631   184
 2  1978 Alabama    AL         adm_new_commit  2115   148
 3  1978 Alabama    AL         adm_viol_tech    150     5
 4  1978 Alabama    AL         adm_oth          155    31
 5  1979 Alabama    AL         adm_total       2596   223
 6  1979 Alabama    AL         adm_new_commit  2314   178
 7  1979 Alabama    AL         adm_oth          214    43
 8  1980 Alabama    AL         adm_total       3584   190
 9  1980 Alabama    AL         adm_new_commit  3569   190
10  1981 Alabama    AL         adm_total       3773   252
# ℹ 8,735 more rows

arrange()

Once you’re finished filtering your data frame, you’ll likely want to arrange or sort the rows. Arranging data in a meaningful way can make a table easier to read for the audience. Similar to filter(), the arrange() function starts with a data frame and then follows with a list of conditions for sorting.

  • .data is the data frame you want to sort.
  • ... is one or more conditions on which to sort the data frame.

Unlike the filter() function, the arrange() function does not change how many rows are in a data frame, the rows are just ordered in a different way.

To demonstrate the arrange() function, first use the filter() function to create a data set that only includes total prison admissions counts.

total_prison_adm <- filter(nps_admit, adm_type == "adm_total")
total_prison_adm
# A tibble: 2,295 × 6
    year state_name state_abbr adm_type      m     f
   <dbl> <chr>      <chr>      <chr>     <dbl> <dbl>
 1  1978 Alabama    AL         adm_total  2631   184
 2  1979 Alabama    AL         adm_total  2596   223
 3  1980 Alabama    AL         adm_total  3584   190
 4  1981 Alabama    AL         adm_total  3773   252
 5  1982 Alabama    AL         adm_total  4168   305
 6  1983 Alabama    AL         adm_total  4336   326
 7  1984 Alabama    AL         adm_total  4393   362
 8  1985 Alabama    AL         adm_total  4050   357
 9  1986 Alabama    AL         adm_total  3973   311
10  1987 Alabama    AL         adm_total  4499   344
# ℹ 2,285 more rows

Notice that the total_prison_adm dataset has 2,295 rows. Now, you can take the total_prison_adm dataset and sort it by the m column.

arrange(total_prison_adm, m)
# A tibble: 2,295 × 6
    year state_name    state_abbr adm_type      m     f
   <dbl> <chr>         <chr>      <chr>     <dbl> <dbl>
 1  1979 North Dakota  ND         adm_total   144     4
 2  1978 North Dakota  ND         adm_total   147     6
 3  1980 North Dakota  ND         adm_total   191     4
 4  1979 Hawaii        HI         adm_total   197     9
 5  1978 Hawaii        HI         adm_total   198     7
 6  1981 North Dakota  ND         adm_total   206     2
 7  1978 New Hampshire NH         adm_total   208     6
 8  1979 New Hampshire NH         adm_total   210     3
 9  1980 New Hampshire NH         adm_total   211     5
10  1982 Hawaii        HI         adm_total   215    16
# ℹ 2,285 more rows

You see that the smallest values in the m column are displayed first. By default, numeric columns are sorted in ascending order, from smallest to largest. Notice the data set still has 2,295 rows, the rows are just in a different order than before.

You can also sort by character columns. In this case, sort the nps_admit dataset by the adm_type column.

arrange(nps_admit, adm_type)
# A tibble: 11,475 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1978 Alabama    AL         adm_new_commit  2115   148
 2  1979 Alabama    AL         adm_new_commit  2314   178
 3  1980 Alabama    AL         adm_new_commit  3569   190
 4  1981 Alabama    AL         adm_new_commit  3755   251
 5  1982 Alabama    AL         adm_new_commit  3982   301
 6  1983 Alabama    AL         adm_new_commit  4103   316
 7  1984 Alabama    AL         adm_new_commit  4108   340
 8  1985 Alabama    AL         adm_new_commit  3795   338
 9  1986 Alabama    AL         adm_new_commit  3088   249
10  1987 Alabama    AL         adm_new_commit  3547   277
# ℹ 11,465 more rows

You see that all the rows with adm_new_commit are displayed first. By default, character columns are sorted in alphabetical order.

To reverse the sort order, wrap the column you want to sort by in the desc() function, which will sort numeric variables from largest to smallest, and character variables in reverse alphabetical order.

arrange(total_prison_adm, desc(m))
# A tibble: 2,295 × 6
    year state_name state_abbr adm_type       m     f
   <dbl> <chr>      <chr>      <chr>      <dbl> <dbl>
 1  2008 California CA         adm_total 126929 14070
 2  2007 California CA         adm_total 125965 13809
 3  2006 California CA         adm_total 124998 13699
 4  1998 California CA         adm_total 121235 13250
 5  1997 California CA         adm_total 119919 12604
 6  1999 California CA         adm_total 118745 12843
 7  2000 California CA         adm_total 117571 12609
 8  2009 California CA         adm_total 117286 12584
 9  2005 California CA         adm_total 116950 12784
10  2001 California CA         adm_total 115708 11620
# ℹ 2,285 more rows
arrange(nps_admit, desc(adm_type))
# A tibble: 11,475 × 6
    year state_name state_abbr adm_type          m     f
   <dbl> <chr>      <chr>      <chr>         <dbl> <dbl>
 1  1978 Alabama    AL         adm_viol_tech   150     5
 2  1979 Alabama    AL         adm_viol_tech    68     2
 3  1980 Alabama    AL         adm_viol_tech     1     0
 4  1981 Alabama    AL         adm_viol_tech     2     0
 5  1982 Alabama    AL         adm_viol_tech     0     0
 6  1983 Alabama    AL         adm_viol_tech    NA    NA
 7  1984 Alabama    AL         adm_viol_tech    NA    NA
 8  1985 Alabama    AL         adm_viol_tech    NA    NA
 9  1986 Alabama    AL         adm_viol_tech   396    19
10  1987 Alabama    AL         adm_viol_tech   477    28
# ℹ 11,465 more rows

Sort by multiple columns

You will usually want to arrange the data frame by multiple conditions. The arrange function evaluates conditions in order.

arrange(df, sort_var_1, sort_var_2) 

The data frame will first be sorted by sort_var_1. In instances where rows have the same sort_var_1 value, those rows will be sorted by sort_var_2.

Let’s say you want to arrange the data frame by the following:

  • Admission type, alphabetically
  • Descending year (most recent year first)
  • State name, alphabetically

You can achieve this order by combining three sorting variables.

arrange(nps_admit, adm_type, desc(year), state_name)
# A tibble: 11,475 × 6
    year state_name           state_abbr adm_type           m     f
   <dbl> <chr>                <chr>      <chr>          <dbl> <dbl>
 1  2022 Alabama              AL         adm_new_commit  6116  1247
 2  2022 Alaska               AK         adm_new_commit  1157   292
 3  2022 Arizona              AZ         adm_new_commit  7979  1086
 4  2022 Arkansas             AR         adm_new_commit  3484   698
 5  2022 California           CA         adm_new_commit 22753  1588
 6  2022 Colorado             CO         adm_new_commit  4186   678
 7  2022 Connecticut          CT         adm_new_commit  2704   220
 8  2022 Delaware             DE         adm_new_commit  1449   245
 9  2022 District of Columbia DC         adm_new_commit    NA    NA
10  2022 Florida              FL         adm_new_commit 24626  3049
# ℹ 11,465 more rows

The pipe

Filtering and arranging data requires two separate function calls (filter() and arrange()). To combine multiple functions in a single chunk of code, use the pipe, |> to link and combine function calls.

For example, you might want to create a dataset that only contains data from 2019 and later and is sorted by admission type and year (in descending order). To do this, combine filtering and arranging operations with the |>.

nps_admit |> 
  filter(year >= 2019) |> 
  arrange(adm_type, desc(year))
# A tibble: 1,020 × 6
    year state_name           state_abbr adm_type           m     f
   <dbl> <chr>                <chr>      <chr>          <dbl> <dbl>
 1  2022 Alabama              AL         adm_new_commit  6116  1247
 2  2022 Alaska               AK         adm_new_commit  1157   292
 3  2022 Arizona              AZ         adm_new_commit  7979  1086
 4  2022 Arkansas             AR         adm_new_commit  3484   698
 5  2022 California           CA         adm_new_commit 22753  1588
 6  2022 Colorado             CO         adm_new_commit  4186   678
 7  2022 Connecticut          CT         adm_new_commit  2704   220
 8  2022 Delaware             DE         adm_new_commit  1449   245
 9  2022 District of Columbia DC         adm_new_commit    NA    NA
10  2022 Florida              FL         adm_new_commit 24626  3049
# ℹ 1,010 more rows

It can be helpful to read a chunk of code like this out loud:

  1. Take the nps_admit data frame and then…
  2. Filter it to include rows where the year is greater than or equal to 2019 and then…
  3. Arrange it by adm_type and in descending order by year.

When verbalizing this code, the pipe—|>— is read as “and then,” signifying the linking of operations on the starting data frame.

Don’t forget to assign

For functions that operate on data frames like filter() and arrange(), unless you assign the modified data frame to a new object, the result only prints in the console rather than modifying the original data frame. Even if you’ve executed code to filter and arrange the data frame, if you print nps_admit, you won’t see any changes.

nps_admit
# A tibble: 11,475 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1978 Alabama    AL         adm_total       2631   184
 2  1978 Alabama    AL         adm_new_commit  2115   148
 3  1978 Alabama    AL         adm_viol_new       0     0
 4  1978 Alabama    AL         adm_viol_tech    150     5
 5  1978 Alabama    AL         adm_oth          155    31
 6  1979 Alabama    AL         adm_total       2596   223
 7  1979 Alabama    AL         adm_new_commit  2314   178
 8  1979 Alabama    AL         adm_viol_new       0     0
 9  1979 Alabama    AL         adm_viol_tech     68     2
10  1979 Alabama    AL         adm_oth          214    43
# ℹ 11,465 more rows

This is because you never assigned the modified data frame to a new object. Create a new data frame called nps_admit_filtered that is the filtered and sorted data frame if you want to save and re-use the modified data frame.

nps_admit_filtered <- nps_admit |> 
  filter(year >= 2019) |> 
  arrange(adm_type, desc(year))

nps_admit_filtered
# A tibble: 1,020 × 6
    year state_name           state_abbr adm_type           m     f
   <dbl> <chr>                <chr>      <chr>          <dbl> <dbl>
 1  2022 Alabama              AL         adm_new_commit  6116  1247
 2  2022 Alaska               AK         adm_new_commit  1157   292
 3  2022 Arizona              AZ         adm_new_commit  7979  1086
 4  2022 Arkansas             AR         adm_new_commit  3484   698
 5  2022 California           CA         adm_new_commit 22753  1588
 6  2022 Colorado             CO         adm_new_commit  4186   678
 7  2022 Connecticut          CT         adm_new_commit  2704   220
 8  2022 Delaware             DE         adm_new_commit  1449   245
 9  2022 District of Columbia DC         adm_new_commit    NA    NA
10  2022 Florida              FL         adm_new_commit 24626  3049
# ℹ 1,010 more rows

Resources