r filter based on two columns and not missing

Advertisements

I have a dataset with multiple rows per id for some ids not all like this

    Id    Date       Col2   Col3    Col4
    1     2011-09-08 Red    NA      NA
    2     2010-07-13 Blue   NA      NA
    3     1997-04-23 Yellow NA      NA
    3     1997-04-23 Black  NA      NA
    3     1997-04-23 Green  NA      Green
    4     2020-01-27 Red    NA      NA
    4     2020-01-27 Blue   NA      NA 
    5     1999-08-03 Red    NA      NA
    5     1999-08-03 Blue   NA      Yellow

For ID with multiple rows I like to retain only rows where values in Col2 is equal to values in Col4. If the ID has multiple rows and the values in Col2 is NOT equal to values in Col4 then filter is not applicable. Expecting a final dataset like this below

   Id    Date       Col2   Col3    Col4
    1     2011-09-08 Red    NA      NA
    2     2010-07-13 Blue   NA      NA
    
    3     1997-04-23 Green  NA      Green
    
    4     2020-01-27 Red    NA      NA
    4     2020-01-27 Blue   NA      NA 
    5     1999-08-03 Red    NA      NA
    5     1999-08-03 Blue   NA      Yellow

What I did so far is this

    df_final <- df1 %>% 
                group_by(ID)
                filter(Col2 == Col4, negate(is.na))

but this is not working , any suggestions regarding this is much appreciated.

>Solution :

Since dplyr::filter joins multiple statements using &, that’s not what we want, and not sure what you’re trying to do with negate(is.na) anyway. However, we can do a comparison to see if Col2 == Col4 and if not, check for the entire group that in no rows that occurs. When doing the comparison, I’ve just replaced NA values in Col4 with "" so the comparison returns T and F, but you can also add in a more complex check for if either are NA.

I also created the column to filter on separately to avoid comparing between the columns twice in a single filter call.

df1 %>% 
  mutate(check = Col2 == replace_na(Col4, "")) %>%
  group_by(Id) %>%
  filter(check | !any(check))
#> # A tibble: 7 × 6
#> # Groups:   Id [5]
#>      Id Date       Col2  Col3  Col4   check
#>   <dbl> <date>     <chr> <lgl> <chr>  <lgl>
#> 1     1 2011-09-08 Red   NA    <NA>   FALSE
#> 2     2 2010-07-13 Blue  NA    <NA>   FALSE
#> 3     3 1997-04-23 Green NA    Green  TRUE 
#> 4     4 2020-01-27 Red   NA    <NA>   FALSE
#> 5     4 2020-01-27 Blue  NA    <NA>   FALSE
#> 6     5 1999-08-03 Red   NA    <NA>   FALSE
#> 7     5 1999-08-03 Blue  NA    Yellow FALSE

Data

structure(list(Id = c(1, 2, 3, 3, 3, 4, 4, 5, 5), Date = structure(c(15225, 
14803, 9974, 9974, 9974, 18288, 18288, 10806, 10806), class = "Date"), 
    Col2 = c("Red", "Blue", "Yellow", "Black", "Green", "Red", 
    "Blue", "Red", "Blue"), Col3 = c(NA, NA, NA, NA, NA, NA, 
    NA, NA, NA), Col4 = c(NA, NA, NA, NA, "Green", NA, NA, NA, 
    "Yellow")), problems = structure(list(row = 7L, col = NA_character_, 
    expected = "5 columns", actual = "6 columns", file = "literal data"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame")), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
    cols = list(Id = structure(list(), class = c("collector_double", 
    "collector")), Date = structure(list(format = ""), class = c("collector_date", 
    "collector")), Col2 = structure(list(), class = c("collector_character", 
    "collector")), Col3 = structure(list(), class = c("collector_logical", 
    "collector")), Col4 = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

Leave a ReplyCancel reply