library(tidyverse)
nps_admit <- read_csv(path = "data/nps-admissions.csv")
nps_admitLesson 10: Create New Columns
Introduction
So far, you’ve learned how to manipulate data by filtering and sorting rows and removing columns. This lesson demonstrates how to create new columns or modify existing columns. You might need to do this if you are calculating a new variable by adding, subtracting, or otherwise altering one or more existing columns. Additionally, you might want to recode or rename some of the values of a character column so that they’re easier to interpret.
For this lesson, you’ll again work with the National Prisoner Statistics program admissions data set. Read that data in from the CSV file and create a new data frame called nps_admit with the following code:
mutate()
In the NPS admissions data, there are columns for male (m) and female (f) admissions, but no total column. To create a new column, use the mutate() function and specify the name of the new column followed by the formula or value of the column. In this case, add m and f to get the total number of admissions.
nps_admit |>
mutate(total = m + f)# A tibble: 11,475 × 7
year state_name state_abbr adm_type m f total
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184 2815
2 1978 Alabama AL adm_new_commit 2115 148 2263
3 1978 Alabama AL adm_viol_new 0 0 0
4 1978 Alabama AL adm_viol_tech 150 5 155
5 1978 Alabama AL adm_oth 155 31 186
6 1979 Alabama AL adm_total 2596 223 2819
7 1979 Alabama AL adm_new_commit 2314 178 2492
8 1979 Alabama AL adm_viol_new 0 0 0
9 1979 Alabama AL adm_viol_tech 68 2 70
10 1979 Alabama AL adm_oth 214 43 257
# ℹ 11,465 more rows
Recode columns
Another common task in data cleaning is recoding variables. The adm_type column of the NPS dataset contains codes for admission types. If you wanted to make a table with this data for a report or otherwise share this data, it would be easier to understand if these codes were translated. First, check the possible values contained in the adm_type column with count().
nps_admit |>
count(adm_type)# A tibble: 5 × 2
adm_type n
<chr> <int>
1 adm_new_commit 2295
2 adm_oth 2295
3 adm_total 2295
4 adm_viol_new 2295
5 adm_viol_tech 2295
There are five admission types in the adm_type column. Let’s modify this column and change the admission types to more easily understood categories by using mutate().
nps_admit |>
mutate(
adm_type = case_match(
.x = adm_type,
"adm_new_commit" ~ "New Commitment",
"adm_oth" ~ "Other",
"adm_total" ~ "Total Admissions",
"adm_viol_new" ~ "New Offense Violation",
"adm_viol_tech" ~ "Technical Violation"
)
)# A tibble: 11,475 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL Total Admissions 2631 184
2 1978 Alabama AL New Commitment 2115 148
3 1978 Alabama AL New Offense Violation 0 0
4 1978 Alabama AL Technical Violation 150 5
5 1978 Alabama AL Other 155 31
6 1979 Alabama AL Total Admissions 2596 223
7 1979 Alabama AL New Commitment 2314 178
8 1979 Alabama AL New Offense Violation 0 0
9 1979 Alabama AL Technical Violation 68 2
10 1979 Alabama AL Other 214 43
# ℹ 11,465 more rows
This is more involved than the addition you did above to create the total column. It involves using a function, case_match(), which matches existing values to new values of a character column. The arguments to case_match are .x, which is the name of the column to modify, followed by a sequence of pairs of old and new values separated by tildes (~).
The syntax of case_match() can be tricky. If you get an error, make sure your commas, tildes, and parentheses are all in the right places.
Don’t forget to assign
As with other functions that operate on data frames like filter() and select(), unless you assign the changed data frame to a new object, the result only prints in the console rather than modifying the original data frame. Even though you’ve excuted code to create a total column and modified the adm_type column, if you print the nps_admit data frame, you won’t see any changes.
nps_admit# A tibble: 11,475 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1978 Alabama AL adm_total 2631 184
2 1978 Alabama AL adm_new_commit 2115 148
3 1978 Alabama AL adm_viol_new 0 0
4 1978 Alabama AL adm_viol_tech 150 5
5 1978 Alabama AL adm_oth 155 31
6 1979 Alabama AL adm_total 2596 223
7 1979 Alabama AL adm_new_commit 2314 178
8 1979 Alabama AL adm_viol_new 0 0
9 1979 Alabama AL adm_viol_tech 68 2
10 1979 Alabama AL adm_oth 214 43
# ℹ 11,465 more rows
This is because you never assigned the altered data frame to a new object. Create a new data frame called nps_admit_recoded that has the changes you made to the original data frame.
nps_admit_recoded <- nps_admit |>
mutate(
total = m + f,
adm_type = case_match(
.x = adm_type,
"adm_total" ~ "Total Admissions",
"adm_new_commit" ~ "New Commitment",
"adm_viol_new" ~ "New Offense Violation",
"adm_viol_tech" ~ "Technical Violation",
"adm_oth" ~ "Other"
)
)Notice that you can add or modify more than one column at a time within mutate() by separating the columns with a comma. The code above first creates the new total column and then recodes the adm_type column. Print the new data frame to see the new and modified columns.
nps_admit_recoded# A tibble: 11,475 × 7
year state_name state_abbr adm_type m f total
<dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 1978 Alabama AL Total Admissions 2631 184 2815
2 1978 Alabama AL New Commitment 2115 148 2263
3 1978 Alabama AL New Offense Violation 0 0 0
4 1978 Alabama AL Technical Violation 150 5 155
5 1978 Alabama AL Other 155 31 186
6 1979 Alabama AL Total Admissions 2596 223 2819
7 1979 Alabama AL New Commitment 2314 178 2492
8 1979 Alabama AL New Offense Violation 0 0 0
9 1979 Alabama AL Technical Violation 68 2 70
10 1979 Alabama AL Other 214 43 257
# ℹ 11,465 more rows
There are many other ways to modify columns that you’ll learn about in future lessons, but generally all of these techniques are used with the same mutate() pattern to alter a data frame.