I obtained some old data which contain groups that hold only 0s. I want to clean this dataset to use it for teaching. While some 0s are naturally occurring in my data, it is not possible for a group to contain only 0s. That leads me to the conclusion that observations in such a group can safely be labelled with NAs.
Goal
I wish that 0 should be replaced with NA only if the whole group is full of 0.
Reprex
library(dplyr)
df <- tibble(key_1 = rep(1:2, each = 4),
key_2 = rep(letters[1:2], each = 2, times = 2),
value = c(0, 0, 0, -1, 1, 2, 1, 0))
df
#> Output
# A tibble: 8 × 3
# key_1 key_2 value
# <int> <chr> <dbl>
# 1 a 0
# 1 a 0
# 1 b 0
# 1 b -1
# 2 a 1
# 2 a 2
# 2 b 1
# 2 b 0
This data represents a situation where multiple key variables uniquely identify each group.
- In group
1a, everything is0. - In group
2b, there is one0.
I expect 1a to be filled with NAs while e.g. 2b and other groups are preserved. I expect the solution to work within a dplyr pipeline, which is not mandatory, but it would be nice.
Desired Output
# key_1 key_2 value
# <int> <chr> <dbl>
# 1 a NA
# 1 a NA
# 1 b 0
# 1 b -1
# 2 a 1
# 2 a 2
# 2 b 1
# 2 b 0
Attempts so far
I tried using dplyr::case_when(), and to set the normal values of the value column as a fallback. This throws the following error:
df |> group_by(key_1, key_2) |>
mutate(value = case_when(sum(value != 0) == 0 ~ NA,
.default = value))
#> Error:
# ! `.default` must have size 1, not size 2.
If I don’t specify a default, everything is NA.
Providing some dummy value to default shows that the condition sum(value != 0) == 0 is working correctly together with group_by.
df |> group_by(key_1, key_2) |>
mutate(value = case_when(sum(value != 0) == 0 ~ NA,
.default = "default"))
#> Output
# A tibble: 8 × 3
# Groups: key_1, key_2 [4]
# key_1 key_2 value
# <int> <chr> <chr>
# 1 a NA
# 1 a NA
# 1 b default
# 1 b default
# 2 a default
# 2 a default
# 2 b default
# 2 b default
>Solution :
You need to make sure you get an NA for each row. (case_when is a little stricter about recycling than base R. Here, your test value is length-1 for the group, and the result you specify, NA, is length-1, but the result you want has the same length as the number of rows in the group.)
df |>
mutate(
value = case_when(all(value == 0) ~ rep(NA, n()), .default = value),
.by = c(key_1, key_2)
)
# # A tibble: 8 × 3
# key_1 key_2 value
# <int> <chr> <dbl>
# 1 1 a NA
# 2 1 a NA
# 3 1 b 0
# 4 1 b -1
# 5 2 a 1
# 6 2 a 2
# 7 2 b 1
# 8 2 b 0