Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Replace NA with random weighted value by group

I want to replace the NAs in a variable with a weighted random sample of the values within a given group. Rough sample data:

 A tibble: 16 × 3
   letter number code 
   <chr>   <int> <chr>
 1 a           1 w1   
 2 a           1 w1   
 3 a           1 w2   
 4 a           1 NA   
 5 a           2 x1   
 6 a           2 x2   
 7 a           2 x2   
 8 a           2 NA   
 9 b           1 y1   
10 b           1 y2   
11 b           1 NA   
12 b           2 z1   
13 b           2 z2   
14 b           2 z3   
15 b           2 z4   
16 b           2 NA   

For each letter & number group, the NA would be replaced based on the proportions that the given codes exist already in that group. The group where letter = a and number = 1, has the values w1 and w2. For each NA code in that group, it would be replaced by sampling those two options, weighted based on frequency they already appear in that group.

Given the outcome would be a random weighted value, there isn’t an exact expected outcome, but in principle this is the idea:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# A tibble: 16 × 3
   letter number code                        
   <chr>   <int> <chr>                       
 1 a           1 w1                          
 2 a           1 w1                          
 3 a           1 w2                          
 4 a           1 2/3 chance w1; 1/3 chance w2
 5 a           2 x1                          
 6 a           2 x2                          
 7 a           2 x2                          
 8 a           2 1/3 chance x1; 2/3 chance x2
 9 b           1 y1                          
10 b           1 y2                          
11 b           1 50% chance either y1 or y2  
12 b           2 z1                          
13 b           2 z2                          
14 b           2 z3                          
15 b           2 z4                          
16 b           2 25% chance z1, z2, z3, z4 

I can do this manually, entering the options and proportions for each group:

df |> 
  mutate(
    replaced =
      case_when(
        letter == "a" & number == 1 ~ replace_na(code, sample(c("w1", "w2"), 1, replace = TRUE, prob = c(2/3, 1/3))),
        letter == "a" & number == 2 ~ replace_na(code, sample(c("x1", "x2"), 1, replace = TRUE, prob = c(1/3, 2/3))),
        letter == "b" & number == 1 ~ replace_na(code, sample(c("y1", "y2"), 1, replace = TRUE,)),
        letter == "b" & number == 2 ~ replace_na(code, sample(c("z1", "z2", "z3", "z4"), 1, replace = TRUE))
      )
  )

But my actual data is too large to make that a viable option. Is there a way to streamline this so the possible values and proportions per group are determined automatically? No reason it needs to be a case_when mutate function. I’ve wondered if splitting into a list of dfs for each group, mapping, and then recombining might be a better approach for this.

structure(list(letter = c("a", "a", "a", "a", "a", "a", "a", 
"a", "b", "b", "b", "b", "b", "b", "b", "b"), number = c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), 
code = c("w1", "w1", "w2", NA, "x1", "x2", "x2", NA, "y1", 
"y2", NA, "z1", "z2", "z3", "z4", NA)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -16L))

>Solution :

We can use sample on the non-missing entries by group:

library(dplyr)
df |>
  group_by(letter, number) |>
  mutate(code = coalesce(code, sample(x = na.omit(code), size = n(), replace = TRUE))) |>
  ungroup()
# # A tibble: 16 × 3
#    letter number code 
#    <chr>   <int> <chr>
#  1 a           1 w1   
#  2 a           1 w1   
#  3 a           1 w2   
#  4 a           1 w2   
#  5 a           2 x1   
#  6 a           2 x2   
#  7 a           2 x2   
#  8 a           2 x2   
#  ...

Empirically, we can check that sample doesn’t do unique() on the vector to sample from, meaning this should work just fine:

set.seed(47)
sample(c('a', 'a', 'a', 'b'), size = 100, replace = TRUE) |> table()
#  a  b 
# 79 21 
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading