library(tidyverse)
arrests_file_url <- "https://github.com/CSGJusticeCenter/va_data/raw/refs/heads/main/courses/intro_r/arrests.csv"
arrests <- read_csv(arrests_file_url)Lesson 17: Strings
Introduction
You’ve heard a lot about character or string columns so far. This lesson goes deeper on strings and demonstrates a series of tips and tricks you can use when working with them. You’ll learn how to create a new string from existing columns, create subsets of strings, change the case of strings, filter rows based on the contents of strings, and finally convert columns between character and numeric type.
Most of the functions that you’ll use in the lesson come from the stringr R package. You don’t need to separately install or attach stringr because it’s part of the tidyverse.
You’ll use the arrest data that you worked with in the previous lesson again in this lesson. Read it in from the web with the following code.
Create a new string
One common string manipulation technique is creating a new string from existing columns. In this case, imagine you wanted to create a combined column with the charge and severity, so at a glance you can get a sense of the seriousness of the arrest. There is an arrest_type column, which is either F (felony), M (misdemeanor), or I (infraction), and a charge_group column, which is one of 22 charge groups. To combine these two columns into a new column, use the paste0() function and specify the order and additional characters you want in the column. In this case, create a column called charge_group_type that is a string with the pattern charge_group [arrest_type].
arrests |>
mutate(charge_group_type = paste0(charge_group, " [", arrest_type, "]")) |>
select(charge_group, arrest_type, charge_group_type)# A tibble: 1,000 × 3
charge_group arrest_type charge_group_type
<chr> <chr> <chr>
1 Weapon (carry/poss) F Weapon (carry/poss) [F]
2 Prostitution/Allied M Prostitution/Allied [M]
3 Other Assaults M Other Assaults [M]
4 Aggravated Assault F Aggravated Assault [F]
5 Vehicle Theft F Vehicle Theft [F]
6 Narcotic Drug Laws F Narcotic Drug Laws [F]
7 Rape F Rape [F]
8 Vehicle Theft F Vehicle Theft [F]
9 Aggravated Assault F Aggravated Assault [F]
10 Homicide F Homicide [F]
# ℹ 990 more rows
Notice that in paste0(), you must enclose any additional characters in quotes (""), including spaces, and you refer to the existing columns by their names alone.
Subset a string
Instead of combining multiple strings, you might want to subset or extract certain portions of a string. One way to do this is by using the position of the string. Consider the arrest_time column, which contains strings in the hh mm format, where the first two digits are the hour, then a space, and the last two digits are the minute. To create a new arrest_hour column, subset the first two characters of the arrest_time column. To create an arrest_minute column, subset the fourth and fifth characters of arrest_time.
arrests |>
mutate(
arrest_hour = str_sub(arrest_time, start = 1, end = 2),
arrest_minute = str_sub(arrest_time, start = 4, end = 5)
) |>
select(arrest_time, arrest_hour, arrest_minute)# A tibble: 1,000 × 3
arrest_time arrest_hour arrest_minute
<chr> <chr> <chr>
1 14 10 14 10
2 20 30 20 30
3 10 00 10 00
4 17 00 17 00
5 00 30 00 30
6 20 30 20 30
7 22 00 22 00
8 03 15 03 15
9 23 50 23 50
10 00 47 00 47
# ℹ 990 more rows
Note the start and end parameters of str_sub(), which specify that all characters between the start and end position should be retained.
If the column you want to extract from is not formatted in a consistent way so that you can use positions, as in the example above, you can use the str_extract() function in combination with regular expressions to extract matched patterns from a string. This is a more advanced topic and will not be covered in this course.
Change the case of a string
There are a few convenience functions that help you change the case of a string. This is something you might want to do when cleaning up data for presentation or visualization. str_to_lower() changes a string to lower case, and str_to_sentence() changes a string to sentence case (first letter capitalized).
arrests |>
mutate(
arrest_type_lower = str_to_lower(arrest_type),
charge_group_sentence = str_to_sentence(charge_group)
) |>
select(arrest_type, arrest_type_lower, charge_group, charge_group_sentence)# A tibble: 1,000 × 4
arrest_type arrest_type_lower charge_group charge_group_sentence
<chr> <chr> <chr> <chr>
1 F f Weapon (carry/poss) Weapon (carry/poss)
2 M m Prostitution/Allied Prostitution/allied
3 M m Other Assaults Other assaults
4 F f Aggravated Assault Aggravated assault
5 F f Vehicle Theft Vehicle theft
6 F f Narcotic Drug Laws Narcotic drug laws
7 F f Rape Rape
8 F f Vehicle Theft Vehicle theft
9 F f Aggravated Assault Aggravated assault
10 F f Homicide Homicide
# ℹ 990 more rows
str_to_upper() and str_to_title() work the same way and change strings to upper and title case, respectively.
Filter using string matching
In Lesson 8, you learned how to filter rows of a data frame based on values in columns. For example, you can keep all arrests for vehicle thefts by filtering for rows where charge_group is equal to "Vehicle Theft".
arrests |>
filter(charge_group == "Vehicle Theft") |>
select(arrest_id, arrest_date, charge_group)# A tibble: 75 × 3
arrest_id arrest_date charge_group
<dbl> <chr> <chr>
1 5 11/30/2021 Vehicle Theft
2 8 4/27/2021 Vehicle Theft
3 43 4/17/2021 Vehicle Theft
4 57 9/9/2021 Vehicle Theft
5 65 10/8/2020 Vehicle Theft
6 86 7/7/2024 Vehicle Theft
7 87 7/8/2021 Vehicle Theft
8 94 3/24/2024 Vehicle Theft
9 130 3/10/2021 Vehicle Theft
10 144 4/7/2020 Vehicle Theft
# ℹ 65 more rows
But you can also filter on partial or non-exact matches. For instance, you can keep all assault arrests with the following code.
arrests |>
filter(str_detect(charge_group, "Assault")) |>
select(arrest_id, arrest_date, charge_group)# A tibble: 291 × 3
arrest_id arrest_date charge_group
<dbl> <chr> <chr>
1 3 5/1/2023 Other Assaults
2 4 10/19/2021 Aggravated Assault
3 9 1/14/2020 Aggravated Assault
4 11 3/9/2020 Aggravated Assault
5 12 5/5/2022 Aggravated Assault
6 20 7/30/2022 Aggravated Assault
7 21 10/12/2024 Aggravated Assault
8 22 4/27/2020 Aggravated Assault
9 25 3/24/2020 Other Assaults
10 26 6/29/2022 Aggravated Assault
# ℹ 281 more rows
Notice that there are rows where the charge_group is Other Assault and other rows where it is Aggravated Assault. By using str_detect(charge_group, "Assault"), you are finding rows where the word "Assault" appears anywhere in the string.
A variation on this type of string matching is finding strings that start (or end) with a certain set of characters. For instance, keep all rows where the charge_group column starts with the letters Di.
arrests |>
filter(str_starts(charge_group, "Di")) |>
select(arrest_id, arrest_date, charge_group)# A tibble: 4 × 3
arrest_id arrest_date charge_group
<dbl> <chr> <chr>
1 228 8/2/2020 Disorderly Conduct
2 274 3/6/2021 Disturbing the Peace
3 369 4/29/2023 Disorderly Conduct
4 699 12/8/2021 Disorderly Conduct
Alternately, keep all rows where the last letter of charge_group is a y.
arrests |>
filter(str_ends(charge_group, "y")) |>
select(arrest_id, arrest_date, charge_group)# A tibble: 125 × 3
arrest_id arrest_date charge_group
<dbl> <chr> <chr>
1 14 5/18/2021 Robbery
2 17 9/12/2024 Larceny
3 39 8/10/2022 Burglary
4 41 1/30/2021 Robbery
5 44 5/4/2021 Robbery
6 46 6/3/2020 Burglary
7 61 5/13/2020 Robbery
8 67 4/14/2021 Larceny
9 73 7/29/2022 Robbery
10 76 7/14/2024 Robbery
# ℹ 115 more rows
It’s important to remember that these functions are case-sensitive, so if you look for strings that end with a capital Y, you won’t return any rows.
arrests |>
filter(str_ends(charge_group, "Y")) |>
select(arrest_id, arrest_date, charge_group)# A tibble: 0 × 3
# ℹ 3 variables: arrest_id <dbl>, arrest_date <chr>, charge_group <chr>
Convert between character and numeric columns
Lastly, you can convert character columns to numeric and numeric columns to characters. In the last lesson, you learned how to convert character columns into a specific kind of numeric column, a date. But you can also convert a character into a generic number. First, use the subset code from above to create a new arrest_hour column.
arrests_with_hour <- arrests |>
mutate(arrest_hour = str_sub(arrest_time, start = 1, end = 2)) |>
select(arrest_time, arrest_hour)
arrests_with_hour# A tibble: 1,000 × 2
arrest_time arrest_hour
<chr> <chr>
1 14 10 14
2 20 30 20
3 10 00 10
4 17 00 17
5 00 30 00
6 20 30 20
7 22 00 22
8 03 15 03
9 23 50 23
10 00 47 00
# ℹ 990 more rows
Notice that the new arrest_hour column is a character, not a number. To convert that column to a number, use the as.numeric() function.
arrests_with_hour |>
mutate(arrest_hour_num = as.numeric(arrest_hour))# A tibble: 1,000 × 3
arrest_time arrest_hour arrest_hour_num
<chr> <chr> <dbl>
1 14 10 14 14
2 20 30 20 20
3 10 00 10 10
4 17 00 17 17
5 00 30 00 0
6 20 30 20 20
7 22 00 22 22
8 03 15 03 3
9 23 50 23 23
10 00 47 00 0
# ℹ 990 more rows
Now, arrest_hour_num is stored as a numeric (double) variable, and you can use that column for math and other operations that require numeric data. This is important because, if for some reason, you wanted to calculate the average of the arrest hour column, it won’t work for the arrest hour that is stored as a character, but it will work for the arrest hour that is stored as a number.
arrests_with_hour |>
mutate(arrest_hour_num = as.numeric(arrest_hour)) |>
summarize(
mean_chr = mean(arrest_hour),
mean_num = mean(arrest_hour_num)
)Warning: There was 1 warning in `summarize()`.
ℹ In argument: `mean_chr = mean(arrest_hour)`.
Caused by warning in `mean.default()`:
! argument is not numeric or logical: returning NA
# A tibble: 1 × 2
mean_chr mean_num
<dbl> <dbl>
1 NA 13.5
Notice the warning that is displayed indicating that the argument is not numeric or logical and so NA is returned. NA represents a missing value in R; you’ll learn about missing values in the next lesson.
You can also convert numeric columns to characters. Currently, the arrest_id column is stored as a number, but if you want to represent that as a character instead, you can use as.character() to convert it.
arrests |>
mutate(arrest_id_chr = as.character(arrest_id)) |>
select(arrest_id, arrest_id_chr)# A tibble: 1,000 × 2
arrest_id arrest_id_chr
<dbl> <chr>
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
# ℹ 990 more rows
Knowing how to manipulate strings is an essential part of data cleaning, and this lesson showed you tips and tricks for working with character variables in R.
Resources
- R for Data Science (2e), Chapter 14 Strings: https://r4ds.hadley.nz/strings.html
- R for Data Science (2e), Chapter 15 Regular expressions: https://r4ds.hadley.nz/regexps
- stringr R package website: https://stringr.tidyverse.org/