Lesson 9: Select and Rename Columns

Introduction

Choosing and renaming columns of a data frame is a common task for data analysts, especially when reporting results in a final table. The data frame may include additional columns that are not relevant for final results, or columns may need to be renamed from short-hand text that is good for code to full names that are good for readers (e.g., adm_type to Admission Type).

The select() and rename() functions allow you to restructure a dataset into the desired format. To demonstrate how select() and rename() work, you’ll use a dataset from previous lessons, National Prison Statistics admissions data. This data can be read in from a CSV file on your computer (described in Lesson 5) using the following code:

library(tidyverse)

nps_admit <- read_csv(path = "data/nps-admissions.csv")

rename()

The rename() function requires at least two arguments:

  • .data is the data frame
  • ... is one or more columns to rename in the form of new_name = orginal_name

For example, you can take the nps_admit dataset and rename the m and f columns.

rename(nps_admit, male = m, female = f)
# A tibble: 11,475 × 6
    year state_name state_abbr adm_type        male female
   <dbl> <chr>      <chr>      <chr>          <dbl>  <dbl>
 1  1978 Alabama    AL         adm_total       2631    184
 2  1978 Alabama    AL         adm_new_commit  2115    148
 3  1978 Alabama    AL         adm_viol_new       0      0
 4  1978 Alabama    AL         adm_viol_tech    150      5
 5  1978 Alabama    AL         adm_oth          155     31
 6  1979 Alabama    AL         adm_total       2596    223
 7  1979 Alabama    AL         adm_new_commit  2314    178
 8  1979 Alabama    AL         adm_viol_new       0      0
 9  1979 Alabama    AL         adm_viol_tech     68      2
10  1979 Alabama    AL         adm_oth          214     43
# ℹ 11,465 more rows

If you want to include spaces or special characters or start the column name with a number, you need to include backticks, `, around the name.

rename(nps_admit, `State Name` = state_name)
# A tibble: 11,475 × 6
    year `State Name` state_abbr adm_type           m     f
   <dbl> <chr>        <chr>      <chr>          <dbl> <dbl>
 1  1978 Alabama      AL         adm_total       2631   184
 2  1978 Alabama      AL         adm_new_commit  2115   148
 3  1978 Alabama      AL         adm_viol_new       0     0
 4  1978 Alabama      AL         adm_viol_tech    150     5
 5  1978 Alabama      AL         adm_oth          155    31
 6  1979 Alabama      AL         adm_total       2596   223
 7  1979 Alabama      AL         adm_new_commit  2314   178
 8  1979 Alabama      AL         adm_viol_new       0     0
 9  1979 Alabama      AL         adm_viol_tech     68     2
10  1979 Alabama      AL         adm_oth          214    43
# ℹ 11,465 more rows
Backticks may look like apostrophes but they are a different character

The backtick sign is usually located on the top left of a US layout keyboard, next to the 1 key.

Column names like these are considered “non-syntactic.” There are a number of rules in R for what constitutes syntactic column names. Syntactic names in R:

  • Contain only letters, numbers, and the dot . or underscore _ characters.
  • Start with a letter or dot ..
  • Are not reserved words, such as if or function or TRUE.

select()

The select() function requires at least two arguments:

  • .data is the data frame
  • ... is one or more column you want to select or remove; or conditions that specify the columns you want to select or remove

For example, you can take the nps_admit dataset and select only the year column.

select(nps_admit, year)
# A tibble: 11,475 × 1
    year
   <dbl>
 1  1978
 2  1978
 3  1978
 4  1978
 5  1978
 6  1979
 7  1979
 8  1979
 9  1979
10  1979
# ℹ 11,465 more rows

You can also remove a column by using an exclamation point, !, before the name. Notice that the dataset below includes all columns except year.

select(nps_admit, !year)
# A tibble: 11,475 × 5
   state_name state_abbr adm_type           m     f
   <chr>      <chr>      <chr>          <dbl> <dbl>
 1 Alabama    AL         adm_total       2631   184
 2 Alabama    AL         adm_new_commit  2115   148
 3 Alabama    AL         adm_viol_new       0     0
 4 Alabama    AL         adm_viol_tech    150     5
 5 Alabama    AL         adm_oth          155    31
 6 Alabama    AL         adm_total       2596   223
 7 Alabama    AL         adm_new_commit  2314   178
 8 Alabama    AL         adm_viol_new       0     0
 9 Alabama    AL         adm_viol_tech     68     2
10 Alabama    AL         adm_oth          214    43
# ℹ 11,465 more rows

The select() function can also be used to change the order of columns in a data frame. Start with the name of the data frame and then add the names of the columns in the order you want them to appear.

select(nps_admit, state_abbr, state_name, adm_type, year, f, m)
# A tibble: 11,475 × 6
   state_abbr state_name adm_type        year     f     m
   <chr>      <chr>      <chr>          <dbl> <dbl> <dbl>
 1 AL         Alabama    adm_total       1978   184  2631
 2 AL         Alabama    adm_new_commit  1978   148  2115
 3 AL         Alabama    adm_viol_new    1978     0     0
 4 AL         Alabama    adm_viol_tech   1978     5   150
 5 AL         Alabama    adm_oth         1978    31   155
 6 AL         Alabama    adm_total       1979   223  2596
 7 AL         Alabama    adm_new_commit  1979   178  2314
 8 AL         Alabama    adm_viol_new    1979     0     0
 9 AL         Alabama    adm_viol_tech   1979     2    68
10 AL         Alabama    adm_oth         1979    43   214
# ℹ 11,465 more rows

Rename columns within select ()

The select() function can also be used to rename the columns that are being selected. Similar to the rename() function, the general form is new_name = old_name. Renaming within the select function is especially useful in the last step prior to reporting because you can select the columns of interest and specify display names for those columns.

select(
  nps_admit, 
  `State Name` = state_name, 
  `State Abbrevation` = state_abbr, 
  `Admission Type` = adm_type, 
  Year = year, 
  Male = m, 
  Female = f
  )
# A tibble: 11,475 × 6
   `State Name` `State Abbrevation` `Admission Type`  Year  Male Female
   <chr>        <chr>               <chr>            <dbl> <dbl>  <dbl>
 1 Alabama      AL                  adm_total         1978  2631    184
 2 Alabama      AL                  adm_new_commit    1978  2115    148
 3 Alabama      AL                  adm_viol_new      1978     0      0
 4 Alabama      AL                  adm_viol_tech     1978   150      5
 5 Alabama      AL                  adm_oth           1978   155     31
 6 Alabama      AL                  adm_total         1979  2596    223
 7 Alabama      AL                  adm_new_commit    1979  2314    178
 8 Alabama      AL                  adm_viol_new      1979     0      0
 9 Alabama      AL                  adm_viol_tech     1979    68      2
10 Alabama      AL                  adm_oth           1979   214     43
# ℹ 11,465 more rows

Select helpers

The example dataset, nps_admit, only has 6 columns, but many datasets you are working with will have many more columns. It can be tedious to continually write out all the column names so it can be more convenient to use helpers. Select helpers are additional operators and functions that help select the columns of interest. The following sections describe select helpers you can use in R.

Select range of variables

If you want to select a group of consecutive variables, you can use the colon, :, to select all columns between the first and last column you specify. For example, select the first four columns in the nps_admit dataset by specifying year:adm_type, which will select the columns between year and adm_type.

select(nps_admit, year:adm_type)
# A tibble: 11,475 × 4
    year state_name state_abbr adm_type      
   <dbl> <chr>      <chr>      <chr>         
 1  1978 Alabama    AL         adm_total     
 2  1978 Alabama    AL         adm_new_commit
 3  1978 Alabama    AL         adm_viol_new  
 4  1978 Alabama    AL         adm_viol_tech 
 5  1978 Alabama    AL         adm_oth       
 6  1979 Alabama    AL         adm_total     
 7  1979 Alabama    AL         adm_new_commit
 8  1979 Alabama    AL         adm_viol_new  
 9  1979 Alabama    AL         adm_viol_tech 
10  1979 Alabama    AL         adm_oth       
# ℹ 11,465 more rows

To remove the first four columns, you can use an exclamation point to negate those columns and “un-select” them.

select(nps_admit, !year:adm_type)
# A tibble: 11,475 × 2
       m     f
   <dbl> <dbl>
 1  2631   184
 2  2115   148
 3     0     0
 4   150     5
 5   155    31
 6  2596   223
 7  2314   178
 8     0     0
 9    68     2
10   214    43
# ℹ 11,465 more rows

Select by matching patterns

There are additional helper functions that help select columns of interest based on the names of the columns.

  • starts_with(): column name starts with a given prefix
  • ends_with(): column name ends with a given suffix
  • contains(): column name contains a given string

To select both the state_name and state_abbr columns, you can use starts_with() to select all columns that begin with "state_".

select(nps_admit, starts_with("state_"))
# A tibble: 11,475 × 2
   state_name state_abbr
   <chr>      <chr>     
 1 Alabama    AL        
 2 Alabama    AL        
 3 Alabama    AL        
 4 Alabama    AL        
 5 Alabama    AL        
 6 Alabama    AL        
 7 Alabama    AL        
 8 Alabama    AL        
 9 Alabama    AL        
10 Alabama    AL        
# ℹ 11,465 more rows

You could also select all columns in which the column name includes the letter "a".

select(nps_admit, contains("a"))
# A tibble: 11,475 × 4
    year state_name state_abbr adm_type      
   <dbl> <chr>      <chr>      <chr>         
 1  1978 Alabama    AL         adm_total     
 2  1978 Alabama    AL         adm_new_commit
 3  1978 Alabama    AL         adm_viol_new  
 4  1978 Alabama    AL         adm_viol_tech 
 5  1978 Alabama    AL         adm_oth       
 6  1979 Alabama    AL         adm_total     
 7  1979 Alabama    AL         adm_new_commit
 8  1979 Alabama    AL         adm_viol_new  
 9  1979 Alabama    AL         adm_viol_tech 
10  1979 Alabama    AL         adm_oth       
# ℹ 11,465 more rows

You can also remove columns using these functions by adding an exclamation point in front of the helper function.

select(nps_admit, !starts_with("state_"))
# A tibble: 11,475 × 4
    year adm_type           m     f
   <dbl> <chr>          <dbl> <dbl>
 1  1978 adm_total       2631   184
 2  1978 adm_new_commit  2115   148
 3  1978 adm_viol_new       0     0
 4  1978 adm_viol_tech    150     5
 5  1978 adm_oth          155    31
 6  1979 adm_total       2596   223
 7  1979 adm_new_commit  2314   178
 8  1979 adm_viol_new       0     0
 9  1979 adm_viol_tech     68     2
10  1979 adm_oth          214    43
# ℹ 11,465 more rows

Select based on data type

One last way to select columns is to use the where() function. This function allows you to choose columns based on an expression. For instance, you can select all columns that are numeric.

select(nps_admit, where(is.numeric))
# A tibble: 11,475 × 3
    year     m     f
   <dbl> <dbl> <dbl>
 1  1978  2631   184
 2  1978  2115   148
 3  1978     0     0
 4  1978   150     5
 5  1978   155    31
 6  1979  2596   223
 7  1979  2314   178
 8  1979     0     0
 9  1979    68     2
10  1979   214    43
# ℹ 11,465 more rows

Don’t forget to assign

If you want changes to be permanent when selecting or renaming, you need to assign the updated data frame to an object. Without assignment, the changes are shown in the console but will not be retained in the original dataset.

nps_admit |> 
  rename(male = m, female = f)
# A tibble: 11,475 × 6
    year state_name state_abbr adm_type        male female
   <dbl> <chr>      <chr>      <chr>          <dbl>  <dbl>
 1  1978 Alabama    AL         adm_total       2631    184
 2  1978 Alabama    AL         adm_new_commit  2115    148
 3  1978 Alabama    AL         adm_viol_new       0      0
 4  1978 Alabama    AL         adm_viol_tech    150      5
 5  1978 Alabama    AL         adm_oth          155     31
 6  1979 Alabama    AL         adm_total       2596    223
 7  1979 Alabama    AL         adm_new_commit  2314    178
 8  1979 Alabama    AL         adm_viol_new       0      0
 9  1979 Alabama    AL         adm_viol_tech     68      2
10  1979 Alabama    AL         adm_oth          214     43
# ℹ 11,465 more rows
nps_admit
# A tibble: 11,475 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1978 Alabama    AL         adm_total       2631   184
 2  1978 Alabama    AL         adm_new_commit  2115   148
 3  1978 Alabama    AL         adm_viol_new       0     0
 4  1978 Alabama    AL         adm_viol_tech    150     5
 5  1978 Alabama    AL         adm_oth          155    31
 6  1979 Alabama    AL         adm_total       2596   223
 7  1979 Alabama    AL         adm_new_commit  2314   178
 8  1979 Alabama    AL         adm_viol_new       0     0
 9  1979 Alabama    AL         adm_viol_tech     68     2
10  1979 Alabama    AL         adm_oth          214    43
# ℹ 11,465 more rows
nps_admit_long_names <- nps_admit |> 
  rename(male = m, female = f)

nps_admit_long_names
# A tibble: 11,475 × 6
    year state_name state_abbr adm_type        male female
   <dbl> <chr>      <chr>      <chr>          <dbl>  <dbl>
 1  1978 Alabama    AL         adm_total       2631    184
 2  1978 Alabama    AL         adm_new_commit  2115    148
 3  1978 Alabama    AL         adm_viol_new       0      0
 4  1978 Alabama    AL         adm_viol_tech    150      5
 5  1978 Alabama    AL         adm_oth          155     31
 6  1979 Alabama    AL         adm_total       2596    223
 7  1979 Alabama    AL         adm_new_commit  2314    178
 8  1979 Alabama    AL         adm_viol_new       0      0
 9  1979 Alabama    AL         adm_viol_tech     68      2
10  1979 Alabama    AL         adm_oth          214     43
# ℹ 11,465 more rows

Resources