I have a large data set that I’m working with, and unfortunately, the labelling of observations is very inconsistent. As a result, I’m trying to clean the data as much as possible so its easier to work with.
One problem I’m dealing with is missing values for a categorical variable. I have provided a basic example using the code below to illustrate the point. I’m wondering if there is an easy/efficient way to group by name and "fill" those NAs in the pos column with the value that appears first and/or most frequently. For example, I’d like the NA value for Name A to be replaced/filled with DEF. That way, if I do any subsequent group_by(pos) using dplyr I won’t have two separate observations for each name. I understand I could use a simple case_when(), but this could be very tedious when working with a very large data set with many, many unique names. Thanks.
name <- rep(paste("Name", LETTERS[1:3]), each = 5)
pos <- c(c(rep("DEF", 4), NA), rep("MID", 5), c(rep("FWD", 4), NA))
d <- data.frame(name, pos)
name pos
1 Name A DEF
2 Name A DEF
3 Name A DEF
4 Name A DEF
5 Name A <NA>
6 Name B MID
7 Name B MID
8 Name B MID
9 Name B MID
10 Name B MID
11 Name C FWD
12 Name C FWD
13 Name C FWD
14 Name C FWD
15 Name C <NA>
>Solution :
Two potential options:
library(tidyverse)
name <- rep(paste("Name", LETTERS[1:3]), each = 5)
pos <- c(c(rep("DEF", 4), NA), rep("MID", 5), c(rep("FWD", 4), NA))
d <- data.frame(name, pos)
# take the first value if not NA, else take the most frequent
d %>%
group_by(name) %>%
mutate(pos = ifelse(!is.na(first(pos)), first(pos), names(which.max(table(pos))))) %>%
ungroup()
#> # A tibble: 15 × 2
#> name pos
#> <chr> <chr>
#> 1 Name A DEF
#> 2 Name A DEF
#> 3 Name A DEF
#> 4 Name A DEF
#> 5 Name A DEF
#> 6 Name B MID
#> 7 Name B MID
#> 8 Name B MID
#> 9 Name B MID
#> 10 Name B MID
#> 11 Name C FWD
#> 12 Name C FWD
#> 13 Name C FWD
#> 14 Name C FWD
#> 15 Name C FWD
# ALternatively, take the pos value from the row before the NA
d %>% group_by(name) %>% mutate(pos = vctrs::vec_fill_missing(pos, direction = "down"))
#> # A tibble: 15 × 2
#> # Groups: name [3]
#> name pos
#> <chr> <chr>
#> 1 Name A DEF
#> 2 Name A DEF
#> 3 Name A DEF
#> 4 Name A DEF
#> 5 Name A DEF
#> 6 Name B MID
#> 7 Name B MID
#> 8 Name B MID
#> 9 Name B MID
#> 10 Name B MID
#> 11 Name C FWD
#> 12 Name C FWD
#> 13 Name C FWD
#> 14 Name C FWD
#> 15 Name C FWD
Created on 2023-10-16 with reprex v2.0.2