Fill groups that have only zeros in them with NA

November 13, 2023

I obtained some old data which contain groups that hold only 0s. I want to clean this dataset to use it for teaching. While some 0s are naturally occurring in my data, it is not possible for a group to contain only 0s. That leads me to the conclusion that observations in such a group can safely be labelled with NAs.

Goal

I wish that 0 should be replaced with NA only if the whole group is full of 0.

Reprex

library(dplyr)
df <- tibble(key_1 = rep(1:2, each = 4),
                 key_2 = rep(letters[1:2], each = 2, times = 2),
                 value = c(0, 0, 0, -1, 1, 2, 1, 0))
df
#> Output
# A tibble: 8 × 3
# key_1 key_2 value
# <int> <chr> <dbl>
#     1 a         0
#     1 a         0
#     1 b         0
#     1 b        -1
#     2 a         1
#     2 a         2
#     2 b         1
#     2 b         0

This data represents a situation where multiple key variables uniquely identify each group.

In group 1a, everything is 0.
In group 2b, there is one 0.

I expect 1a to be filled with NAs while e.g. 2b and other groups are preserved. I expect the solution to work within a dplyr pipeline, which is not mandatory, but it would be nice.

Desired Output

# key_1 key_2 value
# <int> <chr> <dbl>
#     1 a        NA
#     1 a        NA
#     1 b         0
#     1 b        -1
#     2 a         1
#     2 a         2
#     2 b         1
#     2 b         0

Attempts so far

I tried using dplyr::case_when(), and to set the normal values of the value column as a fallback. This throws the following error:

df |> group_by(key_1, key_2) |>
  mutate(value = case_when(sum(value != 0) == 0 ~ NA,
                           .default = value))

#> Error:
# ! `.default` must have size 1, not size 2.

If I don’t specify a default, everything is NA.

Providing some dummy value to default shows that the condition sum(value != 0) == 0 is working correctly together with group_by.

df |> group_by(key_1, key_2) |>
  mutate(value = case_when(sum(value != 0) == 0 ~ NA,
                           .default = "default"))
#> Output
# A tibble: 8 × 3
# Groups:   key_1, key_2 [4]
# key_1 key_2 value
# <int> <chr> <chr>
#     1 a     NA
#     1 a     NA
#     1 b     default
#     1 b     default
#     2 a     default
#     2 a     default
#     2 b     default
#     2 b     default

>Solution :

You need to make sure you get an NA for each row. (case_when is a little stricter about recycling than base R. Here, your test value is length-1 for the group, and the result you specify, NA, is length-1, but the result you want has the same length as the number of rows in the group.)

df |>
  mutate(
    value = case_when(all(value == 0) ~ rep(NA, n()), .default = value),
    .by = c(key_1, key_2)
  )
# # A tibble: 8 × 3
#   key_1 key_2 value
#   <int> <chr> <dbl>
# 1     1 a        NA
# 2     1 a        NA
# 3     1 b         0
# 4     1 b        -1
# 5     2 a         1
# 6     2 a         2
# 7     2 b         1
# 8     2 b         0