library(tidyverse)
nps_admit <- read_csv("data/nps-admissions.csv")Lesson 18: Missing Values
Introduction
You’ve gotten a long way into this course without having to deal with something you probably run into every day: missing data! Because the examples in this course use cleaned-up versions of publicly available data, you’ve been able to focus on learning R functions and operations for data wrangling and visualization without worrying too much about the quality of the data. But data quality and availability are probably central to your real-life data work. This lesson will discuss how R handles missing values, how to check for missing data, and how to analyze data that has missing values.
For this lesson, you’ll need the the NPS admissions file. Read it into R with the following code:
Check for missing values
When working with a new dataset, it’s important to look for missing data. A simple way to see how much data is missing is to use filter() to only keep rows where a certain column is missing. Use is.na() in combination with filter() to identify these rows.
nps_admit |>
filter(is.na(m))# A tibble: 351 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1983 Alabama AL adm_viol_tech NA NA
2 1984 Alabama AL adm_viol_tech NA NA
3 1985 Alabama AL adm_viol_tech NA NA
4 1999 Alabama AL adm_new_commit NA NA
5 1999 Alabama AL adm_viol_new NA NA
6 1999 Alabama AL adm_viol_tech NA NA
7 1999 Alabama AL adm_oth NA NA
8 1983 Alaska AK adm_viol_new NA NA
9 1994 Alaska AK adm_total NA NA
10 1994 Alaska AK adm_new_commit NA NA
# ℹ 341 more rows
In R, a value of NA means that data for that cell is missing. The original dataset has 11,475 rows, and the dataset filtered to include rows where m is missing has 351 rows.
You can also filter for rows with missing values in the f column.
nps_admit |>
filter(is.na(f))# A tibble: 364 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1983 Alabama AL adm_viol_tech NA NA
2 1984 Alabama AL adm_viol_tech NA NA
3 1985 Alabama AL adm_viol_tech NA NA
4 1999 Alabama AL adm_new_commit NA NA
5 1999 Alabama AL adm_viol_new NA NA
6 1999 Alabama AL adm_viol_tech NA NA
7 1999 Alabama AL adm_oth NA NA
8 2003 Alabama AL adm_new_commit 9524 NA
9 1983 Alaska AK adm_viol_new NA NA
10 1994 Alaska AK adm_total NA NA
# ℹ 354 more rows
You can see there is at least one row where m is not missing, but f is missing. Similar to other filtering conditions, you can combine is.na() to find all rows where m and f are both missing.
nps_admit |>
filter(is.na(m) & is.na(f))# A tibble: 343 × 6
year state_name state_abbr adm_type m f
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 1983 Alabama AL adm_viol_tech NA NA
2 1984 Alabama AL adm_viol_tech NA NA
3 1985 Alabama AL adm_viol_tech NA NA
4 1999 Alabama AL adm_new_commit NA NA
5 1999 Alabama AL adm_viol_new NA NA
6 1999 Alabama AL adm_viol_tech NA NA
7 1999 Alabama AL adm_oth NA NA
8 1983 Alaska AK adm_viol_new NA NA
9 1994 Alaska AK adm_total NA NA
10 1994 Alaska AK adm_new_commit NA NA
# ℹ 333 more rows
Packages for missing value detection
Using filter() and is.na() are good quick exploratory checks to get a sense of how much data is missing. You can also use a more systematic approach to assess the completeness of your data. There are a number of R packages that help with this task. One package is skimr. skimr is not part of the tidyverse, so the first time you use it, you’ll need to install it.
install.packages("skimr")The skim() function returns a summary of a data frame that includes the amount of missing data as well as summary statistics.
library(skimr)Warning: package 'skimr' was built under R version 4.4.3
skim(nps_admit)── Data Summary ────────────────────────
Values
Name nps_admit
Number of rows 11475
Number of columns 6
_______________________
Column type frequency:
character 3
numeric 3
________________________
Group variables None
── Variable type: character ────────────────────────────────────────────────────
skim_variable n_missing complete_rate min max empty n_unique whitespace
1 state_name 0 1 4 20 0 51 0
2 state_abbr 0 1 2 2 0 51 0
3 adm_type 0 1 7 14 0 5 0
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
1 year 0 1 2000 13.0 1978 1989 2000 2011 2022
2 m 351 0.969 3505. 8246. 0 143 770 3448 126929
3 f 364 0.968 394. 964. 0 10 71 358 15146
hist
1 ▇▇▇▇▇
2 ▇▁▁▁▁
3 ▇▁▁▁▁
From this summary, you can see that the m column has 351 missing observations, the f column has 364 missing observations, and the 4 other columns have no missing observations.
A more in-depth option for dataset checking is the pointblank package. This package is not included in the tidyverse either, so the first step is to install it.
install.packages("pointblank")There are a huge number of useful features for data validation in pointblank, but the simple scan_data() function returns a neat interactive exploration of a data frame.
library(pointblank)
scan_data(nps_admit)Missing values in calculations
Once you know how much missing data you’re dealing with, you can then decide how to proceed. As an example, calculate the total number of admissions by state and year for both men and women.
nps_admit_state_total <- nps_admit |>
filter(adm_type != "adm_total") |>
group_by(year, state_name) |>
summarize(m = sum(m), f = sum(f)) |>
ungroup()
nps_admit_state_total# A tibble: 2,295 × 4
year state_name m f
<dbl> <chr> <dbl> <dbl>
1 1978 Alabama 2420 184
2 1978 Alaska 267 10
3 1978 Arizona 1752 134
4 1978 Arkansas 1854 107
5 1978 California 11572 725
6 1978 Colorado 1420 64
7 1978 Connecticut 1948 143
8 1978 Delaware 635 32
9 1978 District of Columbia 3440 185
10 1978 Florida 8010 409
# ℹ 2,285 more rows
Then, check to see if there are any rows with missing values in the m column.
nps_admit_state_total |>
filter(is.na(m))# A tibble: 224 × 4
year state_name m f
<dbl> <chr> <dbl> <dbl>
1 1983 Alabama NA NA
2 1983 Alaska NA NA
3 1983 Arkansas NA NA
4 1983 Georgia NA NA
5 1983 Indiana NA NA
6 1983 Massachusetts NA NA
7 1983 Montana NA 21
8 1983 New Hampshire NA NA
9 1983 North Carolina NA NA
10 1983 Tennessee NA NA
# ℹ 214 more rows
For some of the groups, the group_by() and summarize() pattern led to NA. This happens when you try to sum a series of values in which one (or more) of the values is NA. Try it with this simple example:
sum(c(1, 2, NA))[1] NA
NA is returned because R doesn’t know how to add NA to a number. This may seem confusing or annoying, but actually it’s a useful feature! If R treated the NA value as 0 and returned 3, you might not be aware that there is a missing value in your data. To deal with this, you have to explicitly tell sum() to ignore or remove the NA value by setting na.rm = TRUE in the sum function.
sum(c(1, 2, NA), na.rm = TRUE)[1] 3
Now, the NA value is removed and 1 is added to 2 for a total of 3. Note that the NA is not set to 0, but rather is removed from the calculation completely. This will help you when calculating things like mean or median in which including a 0 would change the result. Going back to the admissions by state example, try the group_by() and summarize() again, but set na.rm = TRUE in each of the sum() operations.
nps_admit |>
filter(adm_type != "adm_total") |>
group_by(year, state_name) |>
summarize(m = sum(m, na.rm = TRUE), f = sum(f, na.rm = TRUE)) |>
ungroup() |>
filter(is.na(m))`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 0 × 4
# ℹ 4 variables: year <dbl>, state_name <chr>, m <dbl>, f <dbl>
This time, no rows are returned when you filter for missing values in the m column, and all NA values have been ignored when summing the rows.
Many functions in R have an na.rm argument that forces you to be explicit in how NA values are handled. The mean function works the same as sum in this regard. Try finding the average number of total admissions by state:
nps_admit |>
filter(adm_type == "adm_total") |>
group_by(state_name) |>
summarize(mean_m = mean(m), mean_f = mean(f))# A tibble: 51 × 3
state_name mean_m mean_f
<chr> <dbl> <dbl>
1 Alabama 7639. 1006
2 Alaska NA NA
3 Arizona 8492. 1003.
4 Arkansas 5242. 659.
5 California 70076. 6788.
6 Colorado 5785. 749.
7 Connecticut 4662. 410.
8 Delaware 1530. 224.
9 District of Columbia NA NA
10 Florida 26122. 2851.
# ℹ 41 more rows
This means that for Alaska and the District of Columbia, at least one of the years as a value of NA and the mean can’t be computed. Set na.rm = TRUE to drop the NA value from the mean calculation.
nps_admit |>
filter(adm_type == "adm_total") |>
group_by(state_name) |>
summarize(mean_m = mean(m, na.rm = TRUE), mean_f = mean(f, na.rm = TRUE))# A tibble: 51 × 3
state_name mean_m mean_f
<chr> <dbl> <dbl>
1 Alabama 7639. 1006
2 Alaska 1771. 271.
3 Arizona 8492. 1003.
4 Arkansas 5242. 659.
5 California 70076. 6788.
6 Colorado 5785. 749.
7 Connecticut 4662. 410.
8 Delaware 1530. 224.
9 District of Columbia 4620. 251.
10 Florida 26122. 2851.
# ℹ 41 more rows
Depending on the number of missing values and what calculations you’re trying to make, it may not always be appropriate to drop or ignore missing data. But, it’s essential that you are aware of the extent of missing values in your datasets and actively choose how to deal with the missing data.
Resources
- R for Data Science (2e), Chapter 18 Missing values: https://r4ds.hadley.nz/missing-values
- skimr R package website: https://docs.ropensci.org/skimr
- pointblank R package website: https://rstudio.github.io/pointblank