Lesson 18: Missing Values

Introduction

You’ve gotten a long way into this course without having to deal with something you probably run into every day: missing data! Because the examples in this course use cleaned-up versions of publicly available data, you’ve been able to focus on learning R functions and operations for data wrangling and visualization without worrying too much about the quality of the data. But data quality and availability are probably central to your real-life data work. This lesson will discuss how R handles missing values, how to check for missing data, and how to analyze data that has missing values.

For this lesson, you’ll need the the NPS admissions file. Read it into R with the following code:

library(tidyverse)

nps_admit <- read_csv("data/nps-admissions.csv")

Check for missing values

When working with a new dataset, it’s important to look for missing data. A simple way to see how much data is missing is to use filter() to only keep rows where a certain column is missing. Use is.na() in combination with filter() to identify these rows.

nps_admit |> 
  filter(is.na(m))

# A tibble: 351 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1983 Alabama    AL         adm_viol_tech     NA    NA
 2  1984 Alabama    AL         adm_viol_tech     NA    NA
 3  1985 Alabama    AL         adm_viol_tech     NA    NA
 4  1999 Alabama    AL         adm_new_commit    NA    NA
 5  1999 Alabama    AL         adm_viol_new      NA    NA
 6  1999 Alabama    AL         adm_viol_tech     NA    NA
 7  1999 Alabama    AL         adm_oth           NA    NA
 8  1983 Alaska     AK         adm_viol_new      NA    NA
 9  1994 Alaska     AK         adm_total         NA    NA
10  1994 Alaska     AK         adm_new_commit    NA    NA
# ℹ 341 more rows

In R, a value of NA means that data for that cell is missing. The original dataset has 11,475 rows, and the dataset filtered to include rows where m is missing has 351 rows.

You can also filter for rows with missing values in the f column.

nps_admit |> 
  filter(is.na(f))

# A tibble: 364 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1983 Alabama    AL         adm_viol_tech     NA    NA
 2  1984 Alabama    AL         adm_viol_tech     NA    NA
 3  1985 Alabama    AL         adm_viol_tech     NA    NA
 4  1999 Alabama    AL         adm_new_commit    NA    NA
 5  1999 Alabama    AL         adm_viol_new      NA    NA
 6  1999 Alabama    AL         adm_viol_tech     NA    NA
 7  1999 Alabama    AL         adm_oth           NA    NA
 8  2003 Alabama    AL         adm_new_commit  9524    NA
 9  1983 Alaska     AK         adm_viol_new      NA    NA
10  1994 Alaska     AK         adm_total         NA    NA
# ℹ 354 more rows

You can see there is at least one row where m is not missing, but f is missing. Similar to other filtering conditions, you can combine is.na() to find all rows where m and f are both missing.

nps_admit |> 
  filter(is.na(m) & is.na(f))

# A tibble: 343 × 6
    year state_name state_abbr adm_type           m     f
   <dbl> <chr>      <chr>      <chr>          <dbl> <dbl>
 1  1983 Alabama    AL         adm_viol_tech     NA    NA
 2  1984 Alabama    AL         adm_viol_tech     NA    NA
 3  1985 Alabama    AL         adm_viol_tech     NA    NA
 4  1999 Alabama    AL         adm_new_commit    NA    NA
 5  1999 Alabama    AL         adm_viol_new      NA    NA
 6  1999 Alabama    AL         adm_viol_tech     NA    NA
 7  1999 Alabama    AL         adm_oth           NA    NA
 8  1983 Alaska     AK         adm_viol_new      NA    NA
 9  1994 Alaska     AK         adm_total         NA    NA
10  1994 Alaska     AK         adm_new_commit    NA    NA
# ℹ 333 more rows

Packages for missing value detection

Using filter() and is.na() are good quick exploratory checks to get a sense of how much data is missing. You can also use a more systematic approach to assess the completeness of your data. There are a number of R packages that help with this task. One package is skimr. skimr is not part of the tidyverse, so the first time you use it, you’ll need to install it.

install.packages("skimr")

The skim() function returns a summary of a data frame that includes the amount of missing data as well as summary statistics.

library(skimr)

Warning: package 'skimr' was built under R version 4.4.3

skim(nps_admit)

── Data Summary ────────────────────────
                           Values   
Name                       nps_admit
Number of rows             11475    
Number of columns          6        
_______________________             
Column type frequency:              
  character                3        
  numeric                  3        
________________________            
Group variables            None     

── Variable type: character ────────────────────────────────────────────────────
  skim_variable n_missing complete_rate min max empty n_unique whitespace
1 state_name            0             1   4  20     0       51          0
2 state_abbr            0             1   2   2     0       51          0
3 adm_type              0             1   7  14     0        5          0

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable n_missing complete_rate  mean     sd   p0  p25  p50  p75   p100
1 year                  0         1     2000    13.0 1978 1989 2000 2011   2022
2 m                   351         0.969 3505. 8246.     0  143  770 3448 126929
3 f                   364         0.968  394.  964.     0   10   71  358  15146
  hist 
1 ▇▇▇▇▇
2 ▇▁▁▁▁
3 ▇▁▁▁▁

From this summary, you can see that the m column has 351 missing observations, the f column has 364 missing observations, and the 4 other columns have no missing observations.

A more in-depth option for dataset checking is the pointblank package. This package is not included in the tidyverse either, so the first step is to install it.

install.packages("pointblank")

There are a huge number of useful features for data validation in pointblank, but the simple scan_data() function returns a neat interactive exploration of a data frame.

library(pointblank)
scan_data(nps_admit)

Missing values in calculations

Once you know how much missing data you’re dealing with, you can then decide how to proceed. As an example, calculate the total number of admissions by state and year for both men and women.

nps_admit_state_total <- nps_admit |> 
  filter(adm_type != "adm_total") |> 
  group_by(year, state_name) |> 
  summarize(m = sum(m), f = sum(f)) |> 
  ungroup()

nps_admit_state_total

# A tibble: 2,295 × 4
    year state_name               m     f
   <dbl> <chr>                <dbl> <dbl>
 1  1978 Alabama               2420   184
 2  1978 Alaska                 267    10
 3  1978 Arizona               1752   134
 4  1978 Arkansas              1854   107
 5  1978 California           11572   725
 6  1978 Colorado              1420    64
 7  1978 Connecticut           1948   143
 8  1978 Delaware               635    32
 9  1978 District of Columbia  3440   185
10  1978 Florida               8010   409
# ℹ 2,285 more rows

Then, check to see if there are any rows with missing values in the m column.

nps_admit_state_total |> 
  filter(is.na(m))

# A tibble: 224 × 4
    year state_name         m     f
   <dbl> <chr>          <dbl> <dbl>
 1  1983 Alabama           NA    NA
 2  1983 Alaska            NA    NA
 3  1983 Arkansas          NA    NA
 4  1983 Georgia           NA    NA
 5  1983 Indiana           NA    NA
 6  1983 Massachusetts     NA    NA
 7  1983 Montana           NA    21
 8  1983 New Hampshire     NA    NA
 9  1983 North Carolina    NA    NA
10  1983 Tennessee         NA    NA
# ℹ 214 more rows

For some of the groups, the group_by() and summarize() pattern led to NA. This happens when you try to sum a series of values in which one (or more) of the values is NA. Try it with this simple example:

sum(c(1, 2, NA))

[1] NA

NA is returned because R doesn’t know how to add NA to a number. This may seem confusing or annoying, but actually it’s a useful feature! If R treated the NA value as 0 and returned 3, you might not be aware that there is a missing value in your data. To deal with this, you have to explicitly tell sum() to ignore or remove the NA value by setting na.rm = TRUE in the sum function.

sum(c(1, 2, NA), na.rm = TRUE)

[1] 3

Now, the NA value is removed and 1 is added to 2 for a total of 3. Note that the NA is not set to 0, but rather is removed from the calculation completely. This will help you when calculating things like mean or median in which including a 0 would change the result. Going back to the admissions by state example, try the group_by() and summarize() again, but set na.rm = TRUE in each of the sum() operations.

nps_admit |> 
  filter(adm_type != "adm_total") |> 
  group_by(year, state_name) |> 
  summarize(m = sum(m, na.rm = TRUE), f = sum(f, na.rm = TRUE)) |> 
  ungroup() |> 
  filter(is.na(m))

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

# A tibble: 0 × 4
# ℹ 4 variables: year <dbl>, state_name <chr>, m <dbl>, f <dbl>

This time, no rows are returned when you filter for missing values in the m column, and all NA values have been ignored when summing the rows.

Many functions in R have an na.rm argument that forces you to be explicit in how NA values are handled. The mean function works the same as sum in this regard. Try finding the average number of total admissions by state:

nps_admit |> 
  filter(adm_type == "adm_total") |> 
  group_by(state_name) |> 
  summarize(mean_m = mean(m), mean_f = mean(f))

# A tibble: 51 × 3
   state_name           mean_m mean_f
   <chr>                 <dbl>  <dbl>
 1 Alabama               7639.  1006 
 2 Alaska                  NA     NA 
 3 Arizona               8492.  1003.
 4 Arkansas              5242.   659.
 5 California           70076.  6788.
 6 Colorado              5785.   749.
 7 Connecticut           4662.   410.
 8 Delaware              1530.   224.
 9 District of Columbia    NA     NA 
10 Florida              26122.  2851.
# ℹ 41 more rows

This means that for Alaska and the District of Columbia, at least one of the years as a value of NA and the mean can’t be computed. Set na.rm = TRUE to drop the NA value from the mean calculation.

nps_admit |> 
  filter(adm_type == "adm_total") |> 
  group_by(state_name) |> 
  summarize(mean_m = mean(m, na.rm = TRUE), mean_f = mean(f, na.rm = TRUE))

# A tibble: 51 × 3
   state_name           mean_m mean_f
   <chr>                 <dbl>  <dbl>
 1 Alabama               7639.  1006 
 2 Alaska                1771.   271.
 3 Arizona               8492.  1003.
 4 Arkansas              5242.   659.
 5 California           70076.  6788.
 6 Colorado              5785.   749.
 7 Connecticut           4662.   410.
 8 Delaware              1530.   224.
 9 District of Columbia  4620.   251.
10 Florida              26122.  2851.
# ℹ 41 more rows

Depending on the number of missing values and what calculations you’re trying to make, it may not always be appropriate to drop or ignore missing data. But, it’s essential that you are aware of the extent of missing values in your datasets and actively choose how to deal with the missing data.

Resources

R for Data Science (2e), Chapter 18 Missing values: https://r4ds.hadley.nz/missing-values
skimr R package website: https://docs.ropensci.org/skimr
pointblank R package website: https://rstudio.github.io/pointblank