Lesson 10: Create New Columns

Introduction

So far, you’ve learned how to manipulate data by filtering and sorting rows and removing columns. This lesson demonstrates how to create new columns or modify existing columns. You might need to do this if you are calculating a new variable by adding, subtracting, or otherwise altering one or more existing columns. Additionally, you might want to recode or rename some of the values of a character column so that they’re easier to interpret.

For this lesson, you’ll again work with the National Prisoner Statistics program admissions data set. Read that data in from the CSV file and create a new data frame called nps_admit with the following code:

library(tidyverse)

nps_admit <- read_csv(path = "data/nps-admissions.csv")
nps_admit

`mutate()`

In the NPS admissions data, there are columns for male (m) and female (f) admissions, but no total column. To create a new column, use the mutate() function and specify the name of the new column followed by the formula or value of the column. In this case, add m and f to get the total number of admissions.

nps_admit |> 
  mutate(total = m + f)

# A tibble: 11,475 × 7
    year state_name state_abbr adm_type           m     f total
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl> <dbl>
 1  1978 Alabama    AL         adm_total       2631   184  2815
 2  1978 Alabama    AL         adm_new_commit  2115   148  2263
 3  1978 Alabama    AL         adm_viol_new       0     0     0
 4  1978 Alabama    AL         adm_viol_tech    150     5   155
 5  1978 Alabama    AL         adm_oth          155    31   186
 6  1979 Alabama    AL         adm_total       2596   223  2819
 7  1979 Alabama    AL         adm_new_commit  2314   178  2492
 8  1979 Alabama    AL         adm_viol_new       0     0     0
 9  1979 Alabama    AL         adm_viol_tech     68     2    70
10  1979 Alabama    AL         adm_oth          214    43   257
# ℹ 11,465 more rows

Recode columns

Another common task in data cleaning is recoding variables. The adm_type column of the NPS dataset contains codes for admission types. If you wanted to make a table with this data for a report or otherwise share this data, it would be easier to understand if these codes were translated. First, check the possible values contained in the adm_type column with count().

nps_admit |> 
  count(adm_type)

# A tibble: 5 × 2
  adm_type           n
  <chr>          <int>
1 adm_new_commit  2295
2 adm_oth         2295
3 adm_total       2295
4 adm_viol_new    2295
5 adm_viol_tech   2295

There are five admission types in the adm_type column. Let’s modify this column and change the admission types to more easily understood categories by using mutate().

nps_admit |> 
  mutate(
    adm_type = case_match(
      .x = adm_type,
      "adm_new_commit" ~ "New Commitment",
      "adm_oth"        ~ "Other",
      "adm_total"      ~ "Total Admissions",
      "adm_viol_new"   ~ "New Offense Violation",            
      "adm_viol_tech"  ~ "Technical Violation"
    )
  )

# A tibble: 11,475 × 6
    year state_name state_abbr adm_type                  m     f
   <dbl> <chr>      <chr>      <chr>                 <dbl> <dbl>
 1  1978 Alabama    AL         Total Admissions       2631   184
 2  1978 Alabama    AL         New Commitment         2115   148
 3  1978 Alabama    AL         New Offense Violation     0     0
 4  1978 Alabama    AL         Technical Violation     150     5
 5  1978 Alabama    AL         Other                   155    31
 6  1979 Alabama    AL         Total Admissions       2596   223
 7  1979 Alabama    AL         New Commitment         2314   178
 8  1979 Alabama    AL         New Offense Violation     0     0
 9  1979 Alabama    AL         Technical Violation      68     2
10  1979 Alabama    AL         Other                   214    43
# ℹ 11,465 more rows

This is more involved than the addition you did above to create the total column. It involves using a function, case_match(), which matches existing values to new values of a character column. The arguments to case_match are .x, which is the name of the column to modify, followed by a sequence of pairs of old and new values separated by tildes (~).

The syntax of case_match() can be tricky. If you get an error, make sure your commas, tildes, and parentheses are all in the right places.

Don’t forget to assign

As with other functions that operate on data frames like filter() and select(), unless you assign the changed data frame to a new object, the result only prints in the console rather than modifying the original data frame. Even though you’ve excuted code to create a total column and modified the adm_type column, if you print the nps_admit data frame, you won’t see any changes.

nps_admit

# A tibble: 11,475 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1978 Alabama    AL         adm_total       2631   184
 2  1978 Alabama    AL         adm_new_commit  2115   148
 3  1978 Alabama    AL         adm_viol_new       0     0
 4  1978 Alabama    AL         adm_viol_tech    150     5
 5  1978 Alabama    AL         adm_oth          155    31
 6  1979 Alabama    AL         adm_total       2596   223
 7  1979 Alabama    AL         adm_new_commit  2314   178
 8  1979 Alabama    AL         adm_viol_new       0     0
 9  1979 Alabama    AL         adm_viol_tech     68     2
10  1979 Alabama    AL         adm_oth          214    43
# ℹ 11,465 more rows

This is because you never assigned the altered data frame to a new object. Create a new data frame called nps_admit_recoded that has the changes you made to the original data frame.

nps_admit_recoded <- nps_admit |> 
  mutate(
    total = m + f,
    adm_type = case_match(
      .x = adm_type,
      "adm_total"      ~ "Total Admissions",
      "adm_new_commit" ~ "New Commitment",
      "adm_viol_new"   ~ "New Offense Violation",
      "adm_viol_tech"  ~ "Technical Violation",
      "adm_oth"        ~ "Other"
    )
  )

Notice that you can add or modify more than one column at a time within mutate() by separating the columns with a comma. The code above first creates the new total column and then recodes the adm_type column. Print the new data frame to see the new and modified columns.

nps_admit_recoded

# A tibble: 11,475 × 7
    year state_name state_abbr adm_type                  m     f total
   <dbl> <chr>      <chr>      <chr>                 <dbl> <dbl> <dbl>
 1  1978 Alabama    AL         Total Admissions       2631   184  2815
 2  1978 Alabama    AL         New Commitment         2115   148  2263
 3  1978 Alabama    AL         New Offense Violation     0     0     0
 4  1978 Alabama    AL         Technical Violation     150     5   155
 5  1978 Alabama    AL         Other                   155    31   186
 6  1979 Alabama    AL         Total Admissions       2596   223  2819
 7  1979 Alabama    AL         New Commitment         2314   178  2492
 8  1979 Alabama    AL         New Offense Violation     0     0     0
 9  1979 Alabama    AL         Technical Violation      68     2    70
10  1979 Alabama    AL         Other                   214    43   257
# ℹ 11,465 more rows

There are many other ways to modify columns that you’ll learn about in future lessons, but generally all of these techniques are used with the same mutate() pattern to alter a data frame.

Resources

R for Data Science (2e), Chapter 3 Data transformation