Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Fill groups that have only zeros in them with NA

I obtained some old data which contain groups that hold only 0s. I want to clean this dataset to use it for teaching. While some 0s are naturally occurring in my data, it is not possible for a group to contain only 0s. That leads me to the conclusion that observations in such a group can safely be labelled with NAs.

Goal

I wish that 0 should be replaced with NA only if the whole group is full of 0.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Reprex

library(dplyr)
df <- tibble(key_1 = rep(1:2, each = 4),
                 key_2 = rep(letters[1:2], each = 2, times = 2),
                 value = c(0, 0, 0, -1, 1, 2, 1, 0))
df
#> Output
# A tibble: 8 × 3
# key_1 key_2 value
# <int> <chr> <dbl>
#     1 a         0
#     1 a         0
#     1 b         0
#     1 b        -1
#     2 a         1
#     2 a         2
#     2 b         1
#     2 b         0

This data represents a situation where multiple key variables uniquely identify each group.

  • In group 1a, everything is 0.
  • In group 2b, there is one 0.

I expect 1a to be filled with NAs while e.g. 2b and other groups are preserved. I expect the solution to work within a dplyr pipeline, which is not mandatory, but it would be nice.

Desired Output

# key_1 key_2 value
# <int> <chr> <dbl>
#     1 a        NA
#     1 a        NA
#     1 b         0
#     1 b        -1
#     2 a         1
#     2 a         2
#     2 b         1
#     2 b         0

Attempts so far

I tried using dplyr::case_when(), and to set the normal values of the value column as a fallback. This throws the following error:

df |> group_by(key_1, key_2) |>
  mutate(value = case_when(sum(value != 0) == 0 ~ NA,
                           .default = value))

#> Error:
# ! `.default` must have size 1, not size 2.

If I don’t specify a default, everything is NA.

Providing some dummy value to default shows that the condition sum(value != 0) == 0 is working correctly together with group_by.

df |> group_by(key_1, key_2) |>
  mutate(value = case_when(sum(value != 0) == 0 ~ NA,
                           .default = "default"))
#> Output
# A tibble: 8 × 3
# Groups:   key_1, key_2 [4]
# key_1 key_2 value
# <int> <chr> <chr>
#     1 a     NA
#     1 a     NA
#     1 b     default
#     1 b     default
#     2 a     default
#     2 a     default
#     2 b     default
#     2 b     default

>Solution :

You need to make sure you get an NA for each row. (case_when is a little stricter about recycling than base R. Here, your test value is length-1 for the group, and the result you specify, NA, is length-1, but the result you want has the same length as the number of rows in the group.)

df |>
  mutate(
    value = case_when(all(value == 0) ~ rep(NA, n()), .default = value),
    .by = c(key_1, key_2)
  )
# # A tibble: 8 × 3
#   key_1 key_2 value
#   <int> <chr> <dbl>
# 1     1 a        NA
# 2     1 a        NA
# 3     1 b         0
# 4     1 b        -1
# 5     2 a         1
# 6     2 a         2
# 7     2 b         1
# 8     2 b         0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading