Lesson 17: Strings

Introduction

You’ve heard a lot about character or string columns so far. This lesson goes deeper on strings and demonstrates a series of tips and tricks you can use when working with them. You’ll learn how to create a new string from existing columns, create subsets of strings, change the case of strings, filter rows based on the contents of strings, and finally convert columns between character and numeric type.

Most of the functions that you’ll use in the lesson come from the stringr R package. You don’t need to separately install or attach stringr because it’s part of the tidyverse.

You’ll use the arrest data that you worked with in the previous lesson again in this lesson. Read it in from the web with the following code.

library(tidyverse)

arrests_file_url <- "https://github.com/CSGJusticeCenter/va_data/raw/refs/heads/main/courses/intro_r/arrests.csv"
arrests <- read_csv(arrests_file_url)

Create a new string

One common string manipulation technique is creating a new string from existing columns. In this case, imagine you wanted to create a combined column with the charge and severity, so at a glance you can get a sense of the seriousness of the arrest. There is an arrest_type column, which is either F (felony), M (misdemeanor), or I (infraction), and a charge_group column, which is one of 22 charge groups. To combine these two columns into a new column, use the paste0() function and specify the order and additional characters you want in the column. In this case, create a column called charge_group_type that is a string with the pattern charge_group [arrest_type].

arrests |> 
  mutate(charge_group_type = paste0(charge_group, " [", arrest_type, "]")) |> 
  select(charge_group, arrest_type, charge_group_type)

# A tibble: 1,000 × 3
   charge_group        arrest_type charge_group_type      
   <chr>               <chr>       <chr>                  
 1 Weapon (carry/poss) F           Weapon (carry/poss) [F]
 2 Prostitution/Allied M           Prostitution/Allied [M]
 3 Other Assaults      M           Other Assaults [M]     
 4 Aggravated Assault  F           Aggravated Assault [F] 
 5 Vehicle Theft       F           Vehicle Theft [F]      
 6 Narcotic Drug Laws  F           Narcotic Drug Laws [F] 
 7 Rape                F           Rape [F]               
 8 Vehicle Theft       F           Vehicle Theft [F]      
 9 Aggravated Assault  F           Aggravated Assault [F] 
10 Homicide            F           Homicide [F]           
# ℹ 990 more rows

Notice that in paste0(), you must enclose any additional characters in quotes (""), including spaces, and you refer to the existing columns by their names alone.

Subset a string

Instead of combining multiple strings, you might want to subset or extract certain portions of a string. One way to do this is by using the position of the string. Consider the arrest_time column, which contains strings in the hh mm format, where the first two digits are the hour, then a space, and the last two digits are the minute. To create a new arrest_hour column, subset the first two characters of the arrest_time column. To create an arrest_minute column, subset the fourth and fifth characters of arrest_time.

arrests |> 
  mutate(
    arrest_hour = str_sub(arrest_time, start = 1, end = 2),
    arrest_minute = str_sub(arrest_time, start = 4, end = 5)
  ) |> 
  select(arrest_time, arrest_hour, arrest_minute)

# A tibble: 1,000 × 3
   arrest_time arrest_hour arrest_minute
   <chr>       <chr>       <chr>        
 1 14 10       14          10           
 2 20 30       20          30           
 3 10 00       10          00           
 4 17 00       17          00           
 5 00 30       00          30           
 6 20 30       20          30           
 7 22 00       22          00           
 8 03 15       03          15           
 9 23 50       23          50           
10 00 47       00          47           
# ℹ 990 more rows

Note the start and end parameters of str_sub(), which specify that all characters between the start and end position should be retained.

If the column you want to extract from is not formatted in a consistent way so that you can use positions, as in the example above, you can use the str_extract() function in combination with regular expressions to extract matched patterns from a string. This is a more advanced topic and will not be covered in this course.

Change the case of a string

There are a few convenience functions that help you change the case of a string. This is something you might want to do when cleaning up data for presentation or visualization. str_to_lower() changes a string to lower case, and str_to_sentence() changes a string to sentence case (first letter capitalized).

arrests |> 
  mutate(
    arrest_type_lower = str_to_lower(arrest_type),
    charge_group_sentence = str_to_sentence(charge_group)
    ) |> 
  select(arrest_type, arrest_type_lower, charge_group, charge_group_sentence)

# A tibble: 1,000 × 4
   arrest_type arrest_type_lower charge_group        charge_group_sentence
   <chr>       <chr>             <chr>               <chr>                
 1 F           f                 Weapon (carry/poss) Weapon (carry/poss)  
 2 M           m                 Prostitution/Allied Prostitution/allied  
 3 M           m                 Other Assaults      Other assaults       
 4 F           f                 Aggravated Assault  Aggravated assault   
 5 F           f                 Vehicle Theft       Vehicle theft        
 6 F           f                 Narcotic Drug Laws  Narcotic drug laws   
 7 F           f                 Rape                Rape                 
 8 F           f                 Vehicle Theft       Vehicle theft        
 9 F           f                 Aggravated Assault  Aggravated assault   
10 F           f                 Homicide            Homicide             
# ℹ 990 more rows

str_to_upper() and str_to_title() work the same way and change strings to upper and title case, respectively.

Filter using string matching

In Lesson 8, you learned how to filter rows of a data frame based on values in columns. For example, you can keep all arrests for vehicle thefts by filtering for rows where charge_group is equal to "Vehicle Theft".

arrests |> 
  filter(charge_group == "Vehicle Theft") |> 
  select(arrest_id, arrest_date, charge_group)

# A tibble: 75 × 3
   arrest_id arrest_date charge_group 
       <dbl> <chr>       <chr>        
 1         5 11/30/2021  Vehicle Theft
 2         8 4/27/2021   Vehicle Theft
 3        43 4/17/2021   Vehicle Theft
 4        57 9/9/2021    Vehicle Theft
 5        65 10/8/2020   Vehicle Theft
 6        86 7/7/2024    Vehicle Theft
 7        87 7/8/2021    Vehicle Theft
 8        94 3/24/2024   Vehicle Theft
 9       130 3/10/2021   Vehicle Theft
10       144 4/7/2020    Vehicle Theft
# ℹ 65 more rows

But you can also filter on partial or non-exact matches. For instance, you can keep all assault arrests with the following code.

arrests |> 
  filter(str_detect(charge_group, "Assault")) |> 
  select(arrest_id, arrest_date, charge_group)

# A tibble: 291 × 3
   arrest_id arrest_date charge_group      
       <dbl> <chr>       <chr>             
 1         3 5/1/2023    Other Assaults    
 2         4 10/19/2021  Aggravated Assault
 3         9 1/14/2020   Aggravated Assault
 4        11 3/9/2020    Aggravated Assault
 5        12 5/5/2022    Aggravated Assault
 6        20 7/30/2022   Aggravated Assault
 7        21 10/12/2024  Aggravated Assault
 8        22 4/27/2020   Aggravated Assault
 9        25 3/24/2020   Other Assaults    
10        26 6/29/2022   Aggravated Assault
# ℹ 281 more rows

Notice that there are rows where the charge_group is Other Assault and other rows where it is Aggravated Assault. By using str_detect(charge_group, "Assault"), you are finding rows where the word "Assault" appears anywhere in the string.

A variation on this type of string matching is finding strings that start (or end) with a certain set of characters. For instance, keep all rows where the charge_group column starts with the letters Di.

arrests |> 
  filter(str_starts(charge_group, "Di")) |> 
  select(arrest_id, arrest_date, charge_group)

# A tibble: 4 × 3
  arrest_id arrest_date charge_group        
      <dbl> <chr>       <chr>               
1       228 8/2/2020    Disorderly Conduct  
2       274 3/6/2021    Disturbing the Peace
3       369 4/29/2023   Disorderly Conduct  
4       699 12/8/2021   Disorderly Conduct

Alternately, keep all rows where the last letter of charge_group is a y.

arrests |> 
  filter(str_ends(charge_group, "y")) |> 
  select(arrest_id, arrest_date, charge_group)

# A tibble: 125 × 3
   arrest_id arrest_date charge_group
       <dbl> <chr>       <chr>       
 1        14 5/18/2021   Robbery     
 2        17 9/12/2024   Larceny     
 3        39 8/10/2022   Burglary    
 4        41 1/30/2021   Robbery     
 5        44 5/4/2021    Robbery     
 6        46 6/3/2020    Burglary    
 7        61 5/13/2020   Robbery     
 8        67 4/14/2021   Larceny     
 9        73 7/29/2022   Robbery     
10        76 7/14/2024   Robbery     
# ℹ 115 more rows

It’s important to remember that these functions are case-sensitive, so if you look for strings that end with a capital Y, you won’t return any rows.

arrests |> 
  filter(str_ends(charge_group, "Y")) |> 
  select(arrest_id, arrest_date, charge_group)

# A tibble: 0 × 3
# ℹ 3 variables: arrest_id <dbl>, arrest_date <chr>, charge_group <chr>

Convert between character and numeric columns

Lastly, you can convert character columns to numeric and numeric columns to characters. In the last lesson, you learned how to convert character columns into a specific kind of numeric column, a date. But you can also convert a character into a generic number. First, use the subset code from above to create a new arrest_hour column.

arrests_with_hour <- arrests |> 
  mutate(arrest_hour = str_sub(arrest_time, start = 1, end = 2)) |> 
  select(arrest_time, arrest_hour)

arrests_with_hour

# A tibble: 1,000 × 2
   arrest_time arrest_hour
   <chr>       <chr>      
 1 14 10       14         
 2 20 30       20         
 3 10 00       10         
 4 17 00       17         
 5 00 30       00         
 6 20 30       20         
 7 22 00       22         
 8 03 15       03         
 9 23 50       23         
10 00 47       00         
# ℹ 990 more rows

Notice that the new arrest_hour column is a character, not a number. To convert that column to a number, use the as.numeric() function.

arrests_with_hour |> 
  mutate(arrest_hour_num = as.numeric(arrest_hour))

# A tibble: 1,000 × 3
   arrest_time arrest_hour arrest_hour_num
   <chr>       <chr>                 <dbl>
 1 14 10       14                       14
 2 20 30       20                       20
 3 10 00       10                       10
 4 17 00       17                       17
 5 00 30       00                        0
 6 20 30       20                       20
 7 22 00       22                       22
 8 03 15       03                        3
 9 23 50       23                       23
10 00 47       00                        0
# ℹ 990 more rows

Now, arrest_hour_num is stored as a numeric (double) variable, and you can use that column for math and other operations that require numeric data. This is important because, if for some reason, you wanted to calculate the average of the arrest hour column, it won’t work for the arrest hour that is stored as a character, but it will work for the arrest hour that is stored as a number.

arrests_with_hour |> 
  mutate(arrest_hour_num = as.numeric(arrest_hour)) |> 
  summarize(
    mean_chr = mean(arrest_hour),
    mean_num = mean(arrest_hour_num)
    )

Warning: There was 1 warning in `summarize()`.
ℹ In argument: `mean_chr = mean(arrest_hour)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA

# A tibble: 1 × 2
  mean_chr mean_num
     <dbl>    <dbl>
1       NA     13.5

Notice the warning that is displayed indicating that the argument is not numeric or logical and so NA is returned. NA represents a missing value in R; you’ll learn about missing values in the next lesson.

You can also convert numeric columns to characters. Currently, the arrest_id column is stored as a number, but if you want to represent that as a character instead, you can use as.character() to convert it.

arrests |> 
  mutate(arrest_id_chr = as.character(arrest_id)) |> 
  select(arrest_id, arrest_id_chr)

# A tibble: 1,000 × 2
   arrest_id arrest_id_chr
       <dbl> <chr>        
 1         1 1            
 2         2 2            
 3         3 3            
 4         4 4            
 5         5 5            
 6         6 6            
 7         7 7            
 8         8 8            
 9         9 9            
10        10 10           
# ℹ 990 more rows

Knowing how to manipulate strings is an essential part of data cleaning, and this lesson showed you tips and tricks for working with character variables in R.

Resources

R for Data Science (2e), Chapter 14 Strings: https://r4ds.hadley.nz/strings.html
R for Data Science (2e), Chapter 15 Regular expressions: https://r4ds.hadley.nz/regexps
stringr R package website: https://stringr.tidyverse.org/