library(tidyverse)
nps_admit <- read_csv(path = "data/nps-admissions.csv")Lesson 9: Select and Rename Columns
Introduction
Choosing and renaming columns of a data frame is a common task for data analysts, especially when reporting results in a final table. The data frame may include additional columns that are not relevant for final results, or columns may need to be renamed from short-hand text that is good for code to full names that are good for readers (e.g., adm_type to Admission Type).
The select() and rename() functions allow you to restructure a dataset into the desired format. To demonstrate how select() and rename() work, you’ll use a dataset from previous lessons, National Prison Statistics admissions data. This data can be read in from a CSV file on your computer (described in Lesson 5) using the following code:
rename()
The rename() function requires at least two arguments:
.datais the data frame...is one or more columns to rename in the form ofnew_name = orginal_name
For example, you can take the nps_admit dataset and rename the m and f columns.
rename(nps_admit, male = m, female = f)# A tibble: 11,475 × 6
year state_name state_abbr adm_type male female
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1978 Alabama AL adm_viol_new 0 0
4 1978 Alabama AL adm_viol_tech 150 5
5 1978 Alabama AL adm_oth 155 31
6 1979 Alabama AL adm_total 2596 223
7 1979 Alabama AL adm_new_commit 2314 178
8 1979 Alabama AL adm_viol_new 0 0
9 1979 Alabama AL adm_viol_tech 68 2
10 1979 Alabama AL adm_oth 214 43
# ℹ 11,465 more rows
If you want to include spaces or special characters or start the column name with a number, you need to include backticks, `, around the name.
rename(nps_admit, `State Name` = state_name)# A tibble: 11,475 × 6
year `State Name` state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1978 Alabama AL adm_viol_new 0 0
4 1978 Alabama AL adm_viol_tech 150 5
5 1978 Alabama AL adm_oth 155 31
6 1979 Alabama AL adm_total 2596 223
7 1979 Alabama AL adm_new_commit 2314 178
8 1979 Alabama AL adm_viol_new 0 0
9 1979 Alabama AL adm_viol_tech 68 2
10 1979 Alabama AL adm_oth 214 43
# ℹ 11,465 more rows
The backtick sign is usually located on the top left of a US layout keyboard, next to the 1 key.
Column names like these are considered “non-syntactic.” There are a number of rules in R for what constitutes syntactic column names. Syntactic names in R:
- Contain only letters, numbers, and the dot
.or underscore_characters. - Start with a letter or dot
.. - Are not reserved words, such as
iforfunctionorTRUE.
select()
The select() function requires at least two arguments:
.datais the data frame...is one or more column you want to select or remove; or conditions that specify the columns you want to select or remove
For example, you can take the nps_admit dataset and select only the year column.
select(nps_admit, year)# A tibble: 11,475 × 1
year
<dbl>
1 1978
2 1978
3 1978
4 1978
5 1978
6 1979
7 1979
8 1979
9 1979
10 1979
# ℹ 11,465 more rows
You can also remove a column by using an exclamation point, !, before the name. Notice that the dataset below includes all columns except year.
select(nps_admit, !year)# A tibble: 11,475 × 5
state_name state_abbr adm_type m f
<chr> <chr> <chr> <dbl> <dbl>
1 Alabama AL adm_total 2631 184
2 Alabama AL adm_new_commit 2115 148
3 Alabama AL adm_viol_new 0 0
4 Alabama AL adm_viol_tech 150 5
5 Alabama AL adm_oth 155 31
6 Alabama AL adm_total 2596 223
7 Alabama AL adm_new_commit 2314 178
8 Alabama AL adm_viol_new 0 0
9 Alabama AL adm_viol_tech 68 2
10 Alabama AL adm_oth 214 43
# ℹ 11,465 more rows
The select() function can also be used to change the order of columns in a data frame. Start with the name of the data frame and then add the names of the columns in the order you want them to appear.
select(nps_admit, state_abbr, state_name, adm_type, year, f, m)# A tibble: 11,475 × 6
state_abbr state_name adm_type year f m
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 AL Alabama adm_total 1978 184 2631
2 AL Alabama adm_new_commit 1978 148 2115
3 AL Alabama adm_viol_new 1978 0 0
4 AL Alabama adm_viol_tech 1978 5 150
5 AL Alabama adm_oth 1978 31 155
6 AL Alabama adm_total 1979 223 2596
7 AL Alabama adm_new_commit 1979 178 2314
8 AL Alabama adm_viol_new 1979 0 0
9 AL Alabama adm_viol_tech 1979 2 68
10 AL Alabama adm_oth 1979 43 214
# ℹ 11,465 more rows
Rename columns within select ()
The select() function can also be used to rename the columns that are being selected. Similar to the rename() function, the general form is new_name = old_name. Renaming within the select function is especially useful in the last step prior to reporting because you can select the columns of interest and specify display names for those columns.
select(
nps_admit,
`State Name` = state_name,
`State Abbrevation` = state_abbr,
`Admission Type` = adm_type,
Year = year,
Male = m,
Female = f
)# A tibble: 11,475 × 6
`State Name` `State Abbrevation` `Admission Type` Year Male Female
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Alabama AL adm_total 1978 2631 184
2 Alabama AL adm_new_commit 1978 2115 148
3 Alabama AL adm_viol_new 1978 0 0
4 Alabama AL adm_viol_tech 1978 150 5
5 Alabama AL adm_oth 1978 155 31
6 Alabama AL adm_total 1979 2596 223
7 Alabama AL adm_new_commit 1979 2314 178
8 Alabama AL adm_viol_new 1979 0 0
9 Alabama AL adm_viol_tech 1979 68 2
10 Alabama AL adm_oth 1979 214 43
# ℹ 11,465 more rows
Select helpers
The example dataset, nps_admit, only has 6 columns, but many datasets you are working with will have many more columns. It can be tedious to continually write out all the column names so it can be more convenient to use helpers. Select helpers are additional operators and functions that help select the columns of interest. The following sections describe select helpers you can use in R.
Select range of variables
If you want to select a group of consecutive variables, you can use the colon, :, to select all columns between the first and last column you specify. For example, select the first four columns in the nps_admit dataset by specifying year:adm_type, which will select the columns between year and adm_type.
select(nps_admit, year:adm_type)# A tibble: 11,475 × 4
year state_name state_abbr adm_type
<dbl> <chr> <chr> <chr>
1 1978 Alabama AL adm_total
2 1978 Alabama AL adm_new_commit
3 1978 Alabama AL adm_viol_new
4 1978 Alabama AL adm_viol_tech
5 1978 Alabama AL adm_oth
6 1979 Alabama AL adm_total
7 1979 Alabama AL adm_new_commit
8 1979 Alabama AL adm_viol_new
9 1979 Alabama AL adm_viol_tech
10 1979 Alabama AL adm_oth
# ℹ 11,465 more rows
To remove the first four columns, you can use an exclamation point to negate those columns and “un-select” them.
select(nps_admit, !year:adm_type)# A tibble: 11,475 × 2
m f
<dbl> <dbl>
1 2631 184
2 2115 148
3 0 0
4 150 5
5 155 31
6 2596 223
7 2314 178
8 0 0
9 68 2
10 214 43
# ℹ 11,465 more rows
Select by matching patterns
There are additional helper functions that help select columns of interest based on the names of the columns.
starts_with(): column name starts with a given prefixends_with(): column name ends with a given suffixcontains(): column name contains a given string
To select both the state_name and state_abbr columns, you can use starts_with() to select all columns that begin with "state_".
select(nps_admit, starts_with("state_"))# A tibble: 11,475 × 2
state_name state_abbr
<chr> <chr>
1 Alabama AL
2 Alabama AL
3 Alabama AL
4 Alabama AL
5 Alabama AL
6 Alabama AL
7 Alabama AL
8 Alabama AL
9 Alabama AL
10 Alabama AL
# ℹ 11,465 more rows
You could also select all columns in which the column name includes the letter "a".
select(nps_admit, contains("a"))# A tibble: 11,475 × 4
year state_name state_abbr adm_type
<dbl> <chr> <chr> <chr>
1 1978 Alabama AL adm_total
2 1978 Alabama AL adm_new_commit
3 1978 Alabama AL adm_viol_new
4 1978 Alabama AL adm_viol_tech
5 1978 Alabama AL adm_oth
6 1979 Alabama AL adm_total
7 1979 Alabama AL adm_new_commit
8 1979 Alabama AL adm_viol_new
9 1979 Alabama AL adm_viol_tech
10 1979 Alabama AL adm_oth
# ℹ 11,465 more rows
You can also remove columns using these functions by adding an exclamation point in front of the helper function.
select(nps_admit, !starts_with("state_"))# A tibble: 11,475 × 4
year adm_type m f
<dbl> <chr> <dbl> <dbl>
1 1978 adm_total 2631 184
2 1978 adm_new_commit 2115 148
3 1978 adm_viol_new 0 0
4 1978 adm_viol_tech 150 5
5 1978 adm_oth 155 31
6 1979 adm_total 2596 223
7 1979 adm_new_commit 2314 178
8 1979 adm_viol_new 0 0
9 1979 adm_viol_tech 68 2
10 1979 adm_oth 214 43
# ℹ 11,465 more rows
Select based on data type
One last way to select columns is to use the where() function. This function allows you to choose columns based on an expression. For instance, you can select all columns that are numeric.
select(nps_admit, where(is.numeric))# A tibble: 11,475 × 3
year m f
<dbl> <dbl> <dbl>
1 1978 2631 184
2 1978 2115 148
3 1978 0 0
4 1978 150 5
5 1978 155 31
6 1979 2596 223
7 1979 2314 178
8 1979 0 0
9 1979 68 2
10 1979 214 43
# ℹ 11,465 more rows
Don’t forget to assign
If you want changes to be permanent when selecting or renaming, you need to assign the updated data frame to an object. Without assignment, the changes are shown in the console but will not be retained in the original dataset.
nps_admit |>
rename(male = m, female = f)# A tibble: 11,475 × 6
year state_name state_abbr adm_type male female
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1978 Alabama AL adm_viol_new 0 0
4 1978 Alabama AL adm_viol_tech 150 5
5 1978 Alabama AL adm_oth 155 31
6 1979 Alabama AL adm_total 2596 223
7 1979 Alabama AL adm_new_commit 2314 178
8 1979 Alabama AL adm_viol_new 0 0
9 1979 Alabama AL adm_viol_tech 68 2
10 1979 Alabama AL adm_oth 214 43
# ℹ 11,465 more rows
nps_admit# A tibble: 11,475 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1978 Alabama AL adm_viol_new 0 0
4 1978 Alabama AL adm_viol_tech 150 5
5 1978 Alabama AL adm_oth 155 31
6 1979 Alabama AL adm_total 2596 223
7 1979 Alabama AL adm_new_commit 2314 178
8 1979 Alabama AL adm_viol_new 0 0
9 1979 Alabama AL adm_viol_tech 68 2
10 1979 Alabama AL adm_oth 214 43
# ℹ 11,465 more rows
nps_admit_long_names <- nps_admit |>
rename(male = m, female = f)
nps_admit_long_names# A tibble: 11,475 × 6
year state_name state_abbr adm_type male female
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1978 Alabama AL adm_viol_new 0 0
4 1978 Alabama AL adm_viol_tech 150 5
5 1978 Alabama AL adm_oth 155 31
6 1979 Alabama AL adm_total 2596 223
7 1979 Alabama AL adm_new_commit 2314 178
8 1979 Alabama AL adm_viol_new 0 0
9 1979 Alabama AL adm_viol_tech 68 2
10 1979 Alabama AL adm_oth 214 43
# ℹ 11,465 more rows