Home Replace NA with random weighted value by group

Questions

Replace NA with random weighted value by group

April 25, 2023

I want to replace the NAs in a variable with a weighted random sample of the values within a given group. Rough sample data:

 A tibble: 16 × 3
   letter number code 
   <chr>   <int> <chr>
 1 a           1 w1   
 2 a           1 w1   
 3 a           1 w2   
 4 a           1 NA   
 5 a           2 x1   
 6 a           2 x2   
 7 a           2 x2   
 8 a           2 NA   
 9 b           1 y1   
10 b           1 y2   
11 b           1 NA   
12 b           2 z1   
13 b           2 z2   
14 b           2 z3   
15 b           2 z4   
16 b           2 NA

For each letter & number group, the NA would be replaced based on the proportions that the given codes exist already in that group. The group where letter = a and number = 1, has the values w1 and w2. For each NA code in that group, it would be replaced by sampling those two options, weighted based on frequency they already appear in that group.

Given the outcome would be a random weighted value, there isn’t an exact expected outcome, but in principle this is the idea:

# A tibble: 16 × 3
   letter number code                        
   <chr>   <int> <chr>                       
 1 a           1 w1                          
 2 a           1 w1                          
 3 a           1 w2                          
 4 a           1 2/3 chance w1; 1/3 chance w2
 5 a           2 x1                          
 6 a           2 x2                          
 7 a           2 x2                          
 8 a           2 1/3 chance x1; 2/3 chance x2
 9 b           1 y1                          
10 b           1 y2                          
11 b           1 50% chance either y1 or y2  
12 b           2 z1                          
13 b           2 z2                          
14 b           2 z3                          
15 b           2 z4                          
16 b           2 25% chance z1, z2, z3, z4

I can do this manually, entering the options and proportions for each group:

df |> 
  mutate(
    replaced =
      case_when(
        letter == "a" & number == 1 ~ replace_na(code, sample(c("w1", "w2"), 1, replace = TRUE, prob = c(2/3, 1/3))),
        letter == "a" & number == 2 ~ replace_na(code, sample(c("x1", "x2"), 1, replace = TRUE, prob = c(1/3, 2/3))),
        letter == "b" & number == 1 ~ replace_na(code, sample(c("y1", "y2"), 1, replace = TRUE,)),
        letter == "b" & number == 2 ~ replace_na(code, sample(c("z1", "z2", "z3", "z4"), 1, replace = TRUE))
      )
  )

But my actual data is too large to make that a viable option. Is there a way to streamline this so the possible values and proportions per group are determined automatically? No reason it needs to be a case_when mutate function. I’ve wondered if splitting into a list of dfs for each group, mapping, and then recombining might be a better approach for this.

structure(list(letter = c("a", "a", "a", "a", "a", "a", "a", 
"a", "b", "b", "b", "b", "b", "b", "b", "b"), number = c(1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), 
code = c("w1", "w1", "w2", NA, "x1", "x2", "x2", NA, "y1", 
"y2", NA, "z1", "z2", "z3", "z4", NA)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -16L))

>Solution :

We can use sample on the non-missing entries by group:

library(dplyr)
df |>
  group_by(letter, number) |>
  mutate(code = coalesce(code, sample(x = na.omit(code), size = n(), replace = TRUE))) |>
  ungroup()
# # A tibble: 16 × 3
#    letter number code 
#    <chr>   <int> <chr>
#  1 a           1 w1   
#  2 a           1 w1   
#  3 a           1 w2   
#  4 a           1 w2   
#  5 a           2 x1   
#  6 a           2 x2   
#  7 a           2 x2   
#  8 a           2 x2   
#  ...

Empirically, we can check that sample doesn’t do unique() on the vector to sample from, meaning this should work just fine:

set.seed(47)
sample(c('a', 'a', 'a', 'b'), size = 100, replace = TRUE) |> table()
#  a  b 
# 79 21

random

byMR

Published April 25, 2023

Add a comment

Extract sentence and value from string in R

byMR

April 25, 2023

Questions

How to do things after an observable gets its data in Angular

byMR

April 25, 2023

Questions

How to put every value from the column into the formula? (And make new column with results)

byMR

April 25, 2023

Questions

How to construct a find statement using multiple bash script arguments

byMR

April 25, 2023

Questions

append spliter row after rows which contains 'REL' in action column

byMR

April 25, 2023

Questions

Return all keys with their array items in a SELECT on a JSON field in MYSQL 8

byMR

April 25, 2023

Replace NA with random weighted value by group

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Like this:

Leave a ReplyCancel reply

Read more

Extract sentence and value from string in R

How to do things after an observable gets its data in Angular

How to put every value from the column into the formula? (And make new column with results)

How to construct a find statement using multiple bash script arguments

append spliter row after rows which contains 'REL' in action column

Return all keys with their array items in a SELECT on a JSON field in MYSQL 8

Keep Up to Date with the Most Important News

Replace NA with random weighted value by group

MEDevel.com: Open-source for Healthcare and Education

>Solution :

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

Extract sentence and value from string in R

How to do things after an observable gets its data in Angular

How to put every value from the column into the formula? (And make new column with results)

How to construct a find statement using multiple bash script arguments

append spliter row after rows which contains 'REL' in action column

Return all keys with their array items in a SELECT on a JSON field in MYSQL 8

Discover more from Dev solutions