Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to identify and remove duplicates with multiple condition in R

I would like to identify and remove duplicates from my dataset. The issue is that I have two different criteria. First, the main criteria that I am using is the most recent time. However, not all duplicates have dates. On the cases dates are missing, I will have to check one by one to see which row I am going to keep. So, this is what I need:

Suppose I have the following data:

df <- data.frame(id = c("a", "a", "b","b", "c","c"),
                 date = c("2023-05-16", "2023-08-21", "2023-06-05", NA, "2023-05-07", "2023-05-20"))

library(lubridate)

df$date <- ymd(df$date)
class(df$date)

[1] "Date"

df

  id       date
1  a 2023-05-16
2  a 2023-08-21
3  b 2023-06-05
4  b       <NA>
5  c 2023-05-07
6  c 2023-05-20

First, I need to identify all duplicates that contain NA to compare the rows and see which row I will keep. I need a resulting dataframe similar to this, it contains the duplicate and the unique value:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  id       date
1  b 2023-06-05
2  b       <NA>

Notice that I have both rows so I can compare them. I am keeping the examples simple, but I am not sure on my actual data frame how many duplicates per id I have. Next, I will be left with the remaining dataframe:

  id       date
1  a 2023-05-16
2  a 2023-08-21
3  c 2023-05-07
4  c 2023-05-20

I need to keep only the most recent of the values, and I will end with something like this:

  id       date
1  a 2023-08-21
2  c 2023-05-20

Let me know if something isn’t clear and you need more clarification.

>Solution :

Part 1: Identify all duplicates that contain NA and keep them in a separate df for review:

#Ids of NA dates:
ids_na <- df |> filter(is.na(date)) |> pull(id)

#Dfs of duplicates with NA:
is_na <- df |> filter(id %in% ids_na)

is_na

# id       date
# b 2023-06-05
# b       <NA>

Part 2: keep most recent entry of duplicated values

# Remove dupes with NAs from df
clean_df <- df |> filter(!(id %in% ids_na))
# Get the most recent by grouping 
unique_entries <- clean_df |> group_by(id) |>  summarise(date = max(date))
unique_entries

# A tibble: 2 × 2
# id    date      
# <chr> <date>    
#   a     2023-08-21
#   c     2023-05-20
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading