How to fill missing values with a categorical variable in R?

October 16, 2023

I have a large data set that I’m working with, and unfortunately, the labelling of observations is very inconsistent. As a result, I’m trying to clean the data as much as possible so its easier to work with.

One problem I’m dealing with is missing values for a categorical variable. I have provided a basic example using the code below to illustrate the point. I’m wondering if there is an easy/efficient way to group by name and "fill" those NAs in the pos column with the value that appears first and/or most frequently. For example, I’d like the NA value for Name A to be replaced/filled with DEF. That way, if I do any subsequent group_by(pos) using dplyr I won’t have two separate observations for each name. I understand I could use a simple case_when(), but this could be very tedious when working with a very large data set with many, many unique names. Thanks.

name <- rep(paste("Name", LETTERS[1:3]), each = 5)
pos <- c(c(rep("DEF", 4), NA), rep("MID", 5), c(rep("FWD", 4), NA))

d <- data.frame(name, pos)

name  pos
1  Name A  DEF
2  Name A  DEF
3  Name A  DEF
4  Name A  DEF
5  Name A <NA>
6  Name B  MID
7  Name B  MID
8  Name B  MID
9  Name B  MID
10 Name B  MID
11 Name C  FWD
12 Name C  FWD
13 Name C  FWD
14 Name C  FWD
15 Name C <NA>

>Solution :

Two potential options:

library(tidyverse)

name <- rep(paste("Name", LETTERS[1:3]), each = 5)
pos <- c(c(rep("DEF", 4), NA), rep("MID", 5), c(rep("FWD", 4), NA))

d <- data.frame(name, pos)

# take the first value if not NA, else take the most frequent
d %>%
  group_by(name) %>%
  mutate(pos = ifelse(!is.na(first(pos)), first(pos), names(which.max(table(pos))))) %>%
  ungroup()
#> # A tibble: 15 × 2
#>    name   pos  
#>    <chr>  <chr>
#>  1 Name A DEF  
#>  2 Name A DEF  
#>  3 Name A DEF  
#>  4 Name A DEF  
#>  5 Name A DEF  
#>  6 Name B MID  
#>  7 Name B MID  
#>  8 Name B MID  
#>  9 Name B MID  
#> 10 Name B MID  
#> 11 Name C FWD  
#> 12 Name C FWD  
#> 13 Name C FWD  
#> 14 Name C FWD  
#> 15 Name C FWD

# ALternatively, take the pos value from the row before the NA
d %>% group_by(name) %>% mutate(pos = vctrs::vec_fill_missing(pos, direction = "down"))
#> # A tibble: 15 × 2
#> # Groups:   name [3]
#>    name   pos  
#>    <chr>  <chr>
#>  1 Name A DEF  
#>  2 Name A DEF  
#>  3 Name A DEF  
#>  4 Name A DEF  
#>  5 Name A DEF  
#>  6 Name B MID  
#>  7 Name B MID  
#>  8 Name B MID  
#>  9 Name B MID  
#> 10 Name B MID  
#> 11 Name C FWD  
#> 12 Name C FWD  
#> 13 Name C FWD  
#> 14 Name C FWD  
#> 15 Name C FWD

^{Created on 2023-10-16 with reprex v2.0.2}