I have a dataset with multiple rows per id for some ids not all like this
Id Date Col2 Col3 Col4
1 2011-09-08 Red NA NA
2 2010-07-13 Blue NA NA
3 1997-04-23 Yellow NA NA
3 1997-04-23 Black NA NA
3 1997-04-23 Green NA Green
4 2020-01-27 Red NA NA
4 2020-01-27 Blue NA NA
5 1999-08-03 Red NA NA
5 1999-08-03 Blue NA Yellow
For ID with multiple rows I like to retain only rows where values in Col2
is equal to values in Col4
. If the ID has multiple rows and the values in Col2
is NOT equal to values in Col4
then filter is not applicable. Expecting a final dataset like this below
Id Date Col2 Col3 Col4
1 2011-09-08 Red NA NA
2 2010-07-13 Blue NA NA
3 1997-04-23 Green NA Green
4 2020-01-27 Red NA NA
4 2020-01-27 Blue NA NA
5 1999-08-03 Red NA NA
5 1999-08-03 Blue NA Yellow
What I did so far is this
df_final <- df1 %>%
group_by(ID)
filter(Col2 == Col4, negate(is.na))
but this is not working , any suggestions regarding this is much appreciated.
>Solution :
Since dplyr::filter
joins multiple statements using &
, that’s not what we want, and not sure what you’re trying to do with negate(is.na)
anyway. However, we can do a comparison to see if Col2 == Col4
and if not, check for the entire group that in no rows that occurs. When doing the comparison, I’ve just replaced NA
values in Col4
with ""
so the comparison returns T
and F
, but you can also add in a more complex check for if either are NA
.
I also created the column to filter on separately to avoid comparing between the columns twice in a single filter
call.
df1 %>%
mutate(check = Col2 == replace_na(Col4, "")) %>%
group_by(Id) %>%
filter(check | !any(check))
#> # A tibble: 7 × 6
#> # Groups: Id [5]
#> Id Date Col2 Col3 Col4 check
#> <dbl> <date> <chr> <lgl> <chr> <lgl>
#> 1 1 2011-09-08 Red NA <NA> FALSE
#> 2 2 2010-07-13 Blue NA <NA> FALSE
#> 3 3 1997-04-23 Green NA Green TRUE
#> 4 4 2020-01-27 Red NA <NA> FALSE
#> 5 4 2020-01-27 Blue NA <NA> FALSE
#> 6 5 1999-08-03 Red NA <NA> FALSE
#> 7 5 1999-08-03 Blue NA Yellow FALSE
Data
structure(list(Id = c(1, 2, 3, 3, 3, 4, 4, 5, 5), Date = structure(c(15225,
14803, 9974, 9974, 9974, 18288, 18288, 10806, 10806), class = "Date"),
Col2 = c("Red", "Blue", "Yellow", "Black", "Green", "Red",
"Blue", "Red", "Blue"), Col3 = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA), Col4 = c(NA, NA, NA, NA, "Green", NA, NA, NA,
"Yellow")), problems = structure(list(row = 7L, col = NA_character_,
expected = "5 columns", actual = "6 columns", file = "literal data"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(Id = structure(list(), class = c("collector_double",
"collector")), Date = structure(list(format = ""), class = c("collector_date",
"collector")), Col2 = structure(list(), class = c("collector_character",
"collector")), Col3 = structure(list(), class = c("collector_logical",
"collector")), Col4 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))