Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove distinct values based on condition

I have a dataset I am trying to remove duplicate values on but need to retain the rows where a condition is met. It looks like,

col1 col2
a    NA
a    1
b    1
c    1
d    1
d    2

If I just run the normal distinct functions I retain just the first value/row of the duplicates

col1 col2
a    NA
b    1
c    1
d    1

BUT – I need to retain

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

col1 col2
a    1
b    1
c    1
d    1

I have tried

df <- df %>% 
  group_by(col1) %>%
  top_n(1, col2)

But it seems to be removing extra rows within a larger dataset that does not represent duplicates from col1. It is somehow running it’s own condition on col2 and removing extra beyond the duplicates.

In my real example col1 are serial #’s and col2 are dates. I am trying to remove NA’s from col2 while also trying to preserve any that have the max date of potentially two date values (an older date and a newer date)

>Solution :

We could group arrange and slice:

library(dplyr)

df %>% 
  group_by(col1) %>% 
  arrange(col2, .by_group = TRUE) %>% 
  slice(1)

This (for this example!!!) gives the same result using add_count:

library(dplyr)
df %>%
  add_count(col2) %>% 
  filter(n!=1) %>%
  select(-n)

 col1   col2
  <chr> <int>
1 a         1
2 b         1
3 c         1
4 d         1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading