Lesson 16: Dates and Times

Introduction

The data that you’ve worked with in this course so far has been either numbers or text. This lesson focuses on two special kinds of numeric data: dates and times. While dates and times appear relatively straightforward, there are all sorts of complexities that must be considered when programming or doing math with dates and times. For example, how many days are in a month? Or a year? How about time zones and daylight-saving time? This lesson won’t get into all these topics, but it’s important to be aware of the many possible edge cases that can impact date and time data.

Most of the functions that you’ll use in this lesson come from the lubridate R package. You don’t need to separately install or attach lubridate because it is part of the tidyverse.

Chapter 17 of R for Data Science discusses more of these complexities.

The NPS data that you’ve been working with in this course so far doesn’t have dates and times, so for this lesson, you’ll use arrest data. This is a modified and simplified version of data about arrests released publicly by a police department. Instead of saving this data locally and reading it in as you’ve done with the NPS data, read this data directly into your R session from a CSV file saved on the web.

library(tidyverse)

arrests_file_url <- "https://github.com/CSGJusticeCenter/va_data/raw/refs/heads/main/courses/intro_r/arrests.csv"
arrests <- read_csv(arrests_file_url)

This is a convenience feature of read_csv() that allows you to provide the URL of the CSV file, and read_csv() will download that file to a temporary location on your computer, read it in to R, and then delete the temp file. This can be especially useful with a file that changes frequently, so that you can read in the most recent version rather than manually downloading and saving the file from the web.

Now that you have the arrests data in your R session, do some quick exploration.

arrests

# A tibble: 1,000 × 9
   arrest_id arrest_date arrest_time booking_year booking_month booking_day
       <dbl> <chr>       <chr>              <dbl> <chr>               <dbl>
 1         1 2/4/2020    14 10               2020 February                4
 2         2 6/28/2022   20 30               2022 June                   28
 3         3 5/1/2023    10 00               2023 May                     1
 4         4 10/19/2021  17 00               2021 October                19
 5         5 11/30/2021  00 30               2021 November               30
 6         6 1/8/2022    20 30               2022 January                 8
 7         7 5/28/2022   22 00               2022 May                    29
 8         8 4/27/2021   03 15               2021 April                  27
 9         9 1/14/2020   23 50               2020 January                15
10        10 3/29/2021   00 47               2021 March                  29
# ℹ 990 more rows
# ℹ 3 more variables: booking_time <chr>, arrest_type <chr>, charge_group <chr>

How many rows and columns are there? What data types are those columns stored as? How many arrest types are there?

arrests |> 
  count(arrest_type)

# A tibble: 3 × 2
  arrest_type     n
  <chr>       <int>
1 F             624
2 I               1
3 M             375

What are the most frequent charge groups?

arrests |> 
  count(charge_group, sort = TRUE)

# A tibble: 22 × 2
   charge_group                       n
   <chr>                          <int>
 1 Aggravated Assault               175
 2 Miscellaneous Other Violations   152
 3 Other Assaults                   116
 4 Narcotic Drug Laws                85
 5 Driving Under Influence           83
 6 Weapon (carry/poss)               77
 7 Vehicle Theft                     75
 8 Larceny                           45
 9 Robbery                           39
10 Burglary                          33
# ℹ 12 more rows

Dates

You might have noticed that the arrest_date column is stored as a character or string variable (chr). While it looks like a date in the data preview, it doesn’t act like a date. Try to sort by the arrest date column to see what’s going on:

arrests |> 
  arrange(arrest_date)

# A tibble: 1,000 × 9
   arrest_id arrest_date arrest_time booking_year booking_month booking_day
       <dbl> <chr>       <chr>              <dbl> <chr>               <dbl>
 1       891 1/1/2022    03 55               2022 January                 1
 2       166 1/1/2023    02 45               2023 January                 1
 3       403 1/1/2024    01 30               2024 January                 1
 4        83 1/10/2020   11 45               2020 January                10
 5       800 1/10/2022   08 00               2022 January                10
 6       770 1/11/2020   19 30               2020 January                11
 7       797 1/11/2020   23 40               2020 January                12
 8       293 1/11/2021   07 30               2021 January                11
 9       377 1/13/2021   17 00               2021 January                13
10       315 1/13/2022   11 22               2022 January                13
# ℹ 990 more rows
# ℹ 3 more variables: booking_time <chr>, arrest_type <chr>, charge_group <chr>

If this column sorted correctly as a date, all the 2020 arrests would be before any arrests in 2021 or later. To get these dates to act like dates, you need to convert the arrest_date column from a character variable into a date variable. Use the mdy() function to convert the character to a date and specify that the data in the arrest_date column is stored in the month/day/year.

arrests |> 
  mutate(arrest_date = mdy(arrest_date)) |> 
  arrange(arrest_date)

# A tibble: 1,000 × 9
   arrest_id arrest_date arrest_time booking_year booking_month booking_day
       <dbl> <date>      <chr>              <dbl> <chr>               <dbl>
 1       366 2020-01-02  11 45               2020 January                 2
 2       647 2020-01-03  07 40               2020 January                 3
 3       239 2020-01-04  22 40               2020 January                 5
 4       586 2020-01-04  01 10               2020 January                 4
 5       717 2020-01-04  18 00               2020 January                 4
 6       990 2020-01-04  14 45               2020 January                 4
 7       878 2020-01-05  09 00               2020 January                 5
 8       116 2020-01-06  22 10               2020 January                 7
 9       820 2020-01-08  13 45               2020 January                 8
10       978 2020-01-08  02 00               2020 January                 8
# ℹ 990 more rows
# ℹ 3 more variables: booking_time <chr>, arrest_type <chr>, charge_group <chr>

Now, you can see arrest_date is a date column type and the data frame is now sorted in correct chronological order.

Dates from multiple columns

For the booking date, there are three separate columns: booking_year, booking_month, and booking_day. Year and day are numbers, and the month is written out in text. To create a booking_date column from these three columns, you can use a process similar to how you made arrest_date with a couple of small additions. First, use the paste() function to combine booking_year, booking_month, and booking_day into a single string and then use ymd() to create the date. Use ymd() (year, month, day) instead of mdy() (month, day, year) because in this case, the year comes first in the character you’re converting to a date.

arrests |> 
  mutate(booking_date = ymd(paste(booking_year, booking_month, booking_day))) |> 
  select(starts_with("booking"))

# A tibble: 1,000 × 5
   booking_year booking_month booking_day booking_time booking_date
          <dbl> <chr>               <dbl> <chr>        <date>      
 1         2020 February                4 16 44        2020-02-04  
 2         2022 June                   28 22 36        2022-06-28  
 3         2023 May                     1 10 39        2023-05-01  
 4         2021 October                19 21 27        2021-10-19  
 5         2021 November               30 02 37        2021-11-30  
 6         2022 January                 8 22 16        2022-01-08  
 7         2022 May                    29 09 47        2022-05-29  
 8         2021 April                  27 06 10        2021-04-27  
 9         2020 January                15 00 53        2020-01-15  
10         2021 March                  29 04 48        2021-03-29  
# ℹ 990 more rows

Generally, mdy(), ymd(), and dmy() are pretty good at parsing whatever input you throw at them. But if you’re having trouble converting text to dates with these functions, you may need to adjust or manipulate the input string.

Create a new data frame with arrest and booking dates converted to date columns and remove the booking year, month, and day columns.

arrests_date_clean <- arrests |> 
  mutate(
    arrest_date = mdy(arrest_date),
    booking_date = ymd(paste(booking_year, booking_month, booking_day))
    ) |> 
  select(-c(booking_year, booking_month, booking_day))

Extract components of dates

One useful thing you can do with a date column is extract components from the date. For instance, if you want to create new arrest_year and arrest_month columns, you can do that with the year() and month() functions.

arrests_date_clean |> 
  mutate(
    arrest_year = year(arrest_date),
    arrest_month = month(arrest_date)
    ) |> 
  select(starts_with("arrest"))

# A tibble: 1,000 × 6
   arrest_id arrest_date arrest_time arrest_type arrest_year arrest_month
       <dbl> <date>      <chr>       <chr>             <dbl>        <dbl>
 1         1 2020-02-04  14 10       F                  2020            2
 2         2 2022-06-28  20 30       M                  2022            6
 3         3 2023-05-01  10 00       M                  2023            5
 4         4 2021-10-19  17 00       F                  2021           10
 5         5 2021-11-30  00 30       F                  2021           11
 6         6 2022-01-08  20 30       F                  2022            1
 7         7 2022-05-28  22 00       F                  2022            5
 8         8 2021-04-27  03 15       F                  2021            4
 9         9 2020-01-14  23 50       F                  2020            1
10        10 2021-03-29  00 47       F                  2021            3
# ℹ 990 more rows

You can also extract the month as a name, rather than a number, by setting label = TRUE.

arrests_date_clean |> 
  mutate(
    arrest_year = year(arrest_date),
    arrest_month = month(arrest_date, label = TRUE, abbr = FALSE)
    ) |> 
  select(starts_with("arrest"))

# A tibble: 1,000 × 6
   arrest_id arrest_date arrest_time arrest_type arrest_year arrest_month
       <dbl> <date>      <chr>       <chr>             <dbl> <ord>       
 1         1 2020-02-04  14 10       F                  2020 February    
 2         2 2022-06-28  20 30       M                  2022 June        
 3         3 2023-05-01  10 00       M                  2023 May         
 4         4 2021-10-19  17 00       F                  2021 October     
 5         5 2021-11-30  00 30       F                  2021 November    
 6         6 2022-01-08  20 30       F                  2022 January     
 7         7 2022-05-28  22 00       F                  2022 May         
 8         8 2021-04-27  03 15       F                  2021 April       
 9         9 2020-01-14  23 50       F                  2020 January     
10        10 2021-03-29  00 47       F                  2021 March       
# ℹ 990 more rows

Date-times

The arrest dataset also contains columns for arrest time and booking time. In addition to the date data type, R also can store date-times, and there are functions that help create date-times similar to the date helpers. In this case, the time column we have represents hours and minutes (no seconds), so you can use ymd_hm() (year, month, day, hour, minute) to create the date-time columns.

arrests_date_clean |> 
  mutate(
    arrest_date_time = ymd_hm(paste(arrest_date, arrest_time)),
    booking_date_time = ymd_hm(paste(booking_date, booking_time)),
    ) |> 
  select(arrest_date, arrest_time, arrest_date_time,
         booking_date, booking_time, booking_date_time)

# A tibble: 1,000 × 6
   arrest_date arrest_time arrest_date_time    booking_date booking_time
   <date>      <chr>       <dttm>              <date>       <chr>       
 1 2020-02-04  14 10       2020-02-04 14:10:00 2020-02-04   16 44       
 2 2022-06-28  20 30       2022-06-28 20:30:00 2022-06-28   22 36       
 3 2023-05-01  10 00       2023-05-01 10:00:00 2023-05-01   10 39       
 4 2021-10-19  17 00       2021-10-19 17:00:00 2021-10-19   21 27       
 5 2021-11-30  00 30       2021-11-30 00:30:00 2021-11-30   02 37       
 6 2022-01-08  20 30       2022-01-08 20:30:00 2022-01-08   22 16       
 7 2022-05-28  22 00       2022-05-28 22:00:00 2022-05-29   09 47       
 8 2021-04-27  03 15       2021-04-27 03:15:00 2021-04-27   06 10       
 9 2020-01-14  23 50       2020-01-14 23:50:00 2020-01-15   00 53       
10 2021-03-29  00 47       2021-03-29 00:47:00 2021-03-29   04 48       
# ℹ 990 more rows
# ℹ 1 more variable: booking_date_time <dttm>

To make this work, use paste() again to combine the existing arrest_date and arrest_time columns and then convert that string to date-time with ymd_hm().

arrests_date_time_clean <- arrests_date_clean |> 
  mutate(
    arrest_date_time = ymd_hm(paste(arrest_date, arrest_time)),
    booking_date_time = ymd_hm(paste(booking_date, booking_time))
    ) |> 
  select(-c(arrest_time, booking_time))

Math with dates and times

Now that you have the dates and times in a structured format, you can use those columns to perform calculations. A common calculation is determining the number of days between two dates. Simple subtraction works to find the number of days from arrest to booking.

arrests_date_time_clean |> 
  mutate(arrest_to_booking_day = booking_date - arrest_date) |> 
  select(arrest_date, booking_date, arrest_to_booking_day)

# A tibble: 1,000 × 3
   arrest_date booking_date arrest_to_booking_day
   <date>      <date>       <drtn>               
 1 2020-02-04  2020-02-04   0 days               
 2 2022-06-28  2022-06-28   0 days               
 3 2023-05-01  2023-05-01   0 days               
 4 2021-10-19  2021-10-19   0 days               
 5 2021-11-30  2021-11-30   0 days               
 6 2022-01-08  2022-01-08   0 days               
 7 2022-05-28  2022-05-29   1 days               
 8 2021-04-27  2021-04-27   0 days               
 9 2020-01-14  2020-01-15   1 days               
10 2021-03-29  2021-03-29   0 days               
# ℹ 990 more rows

Most bookings take place on the same day as an arrest, so a more interesting calculation to make is the amount of time between arrest and booking. Subtract arrest_date_time from booking_date_time to calculate this.

arrests_date_time_clean |> 
  mutate(arrest_to_booking_min = booking_date_time - arrest_date_time) |> 
  select(arrest_date_time, booking_date_time, arrest_to_booking_min)

# A tibble: 1,000 × 3
   arrest_date_time    booking_date_time   arrest_to_booking_min
   <dttm>              <dttm>              <drtn>               
 1 2020-02-04 14:10:00 2020-02-04 16:44:00 154 mins             
 2 2022-06-28 20:30:00 2022-06-28 22:36:00 126 mins             
 3 2023-05-01 10:00:00 2023-05-01 10:39:00  39 mins             
 4 2021-10-19 17:00:00 2021-10-19 21:27:00 267 mins             
 5 2021-11-30 00:30:00 2021-11-30 02:37:00 127 mins             
 6 2022-01-08 20:30:00 2022-01-08 22:16:00 106 mins             
 7 2022-05-28 22:00:00 2022-05-29 09:47:00 707 mins             
 8 2021-04-27 03:15:00 2021-04-27 06:10:00 175 mins             
 9 2020-01-14 23:50:00 2020-01-15 00:53:00  63 mins             
10 2021-03-29 00:47:00 2021-03-29 04:48:00 241 mins             
# ℹ 990 more rows

There seems to be more variation in the number of minutes between arrest and booking than there was in the number of days. Create a dataset with this new variable so that you can do some deeper exploration.

arrests_calc <- arrests_date_time_clean |> 
  mutate(arrest_to_booking_min = booking_date_time - arrest_date_time)

As usual, the first thing to do is look at your data! A good way to visualize the time from arrest to booking is with a histogram, which will show the distribution of time in the dataset.

arrests_calc |>
  ggplot(aes(arrest_to_booking_min)) +
  geom_histogram(binwidth = 15, color = "white")

The distribution is skewed right, and the most common amount of time between arrest and booking is around three hours. But, it’s probably the case that there is more or less time between arrest and booking for certain types of arrests. There is an arrest_type column in the dataset, so you can use group_by() + summarize() to calculate the mean and median for each arrest type.

arrests_calc |> 
  group_by(arrest_type) |> 
  summarize(
    mean = mean(arrest_to_booking_min),
    median = median(arrest_to_booking_min)
  )

# A tibble: 3 × 3
  arrest_type mean          median  
  <chr>       <drtn>        <drtn>  
1 F           200.4808 mins 170 mins
2 I           129.0000 mins 129 mins
3 M           166.4533 mins 148 mins

It looks like on average, there is slightly more time between arrest and booking for felony arrests than for misdemeanors or infractions. Another factor that may impact time to booking is the offense type. Look at the same summary statistics, but this time group by charge_group. Also, arrange the summarized dataset by mean so that you can see the charges with the most time between arrest and booking.

arrests_calc |> 
  group_by(charge_group) |> 
  summarize(
    mean = mean(arrest_to_booking_min),
    median = median(arrest_to_booking_min),
    ) |> 
  arrange(desc(mean))

# A tibble: 22 × 3
   charge_group            mean          median    
   <chr>                   <drtn>        <drtn>    
 1 Rape                    364.1250 mins 347.5 mins
 2 Homicide                335.1000 mins 349.0 mins
 3 Receive Stolen Property 249.1250 mins 190.0 mins
 4 Fraud/Embezzlement      212.8000 mins 194.0 mins
 5 Aggravated Assault      205.1714 mins 170.0 mins
 6 Weapon (carry/poss)     198.7013 mins 162.0 mins
 7 Against Family/Child    198.5000 mins 164.0 mins
 8 Narcotic Drug Laws      194.0471 mins 165.0 mins
 9 Robbery                 193.2308 mins 172.0 mins
10 Prostitution/Allied     188.4375 mins 162.5 mins
# ℹ 12 more rows

Rape and homicide are the offenses with the most time to booking, which makes sense because these are likely the most complex cases and may require additional processing or casework.

Resources

R for Data Science (2e), Chapter 17 Dates and times: https://r4ds.hadley.nz/datetimes
lubridate R package website: https://lubridate.tidyverse.org/