How to identify and remove duplicates with multiple condition in R

October 31, 2023

I would like to identify and remove duplicates from my dataset. The issue is that I have two different criteria. First, the main criteria that I am using is the most recent time. However, not all duplicates have dates. On the cases dates are missing, I will have to check one by one to see which row I am going to keep. So, this is what I need:

Suppose I have the following data:

df <- data.frame(id = c("a", "a", "b","b", "c","c"),
                 date = c("2023-05-16", "2023-08-21", "2023-06-05", NA, "2023-05-07", "2023-05-20"))

library(lubridate)

df$date <- ymd(df$date)
class(df$date)

[1] "Date"

df

  id       date
1  a 2023-05-16
2  a 2023-08-21
3  b 2023-06-05
4  b       <NA>
5  c 2023-05-07
6  c 2023-05-20

First, I need to identify all duplicates that contain NA to compare the rows and see which row I will keep. I need a resulting dataframe similar to this, it contains the duplicate and the unique value:

  id       date
1  b 2023-06-05
2  b       <NA>

Notice that I have both rows so I can compare them. I am keeping the examples simple, but I am not sure on my actual data frame how many duplicates per id I have. Next, I will be left with the remaining dataframe:

  id       date
1  a 2023-05-16
2  a 2023-08-21
3  c 2023-05-07
4  c 2023-05-20

I need to keep only the most recent of the values, and I will end with something like this:

  id       date
1  a 2023-08-21
2  c 2023-05-20

Let me know if something isn’t clear and you need more clarification.

>Solution :

Part 1: Identify all duplicates that contain NA and keep them in a separate df for review:

#Ids of NA dates:
ids_na <- df |> filter(is.na(date)) |> pull(id)

#Dfs of duplicates with NA:
is_na <- df |> filter(id %in% ids_na)

is_na

# id       date
# b 2023-06-05
# b       <NA>

Part 2: keep most recent entry of duplicated values

# Remove dupes with NAs from df
clean_df <- df |> filter(!(id %in% ids_na))
# Get the most recent by grouping 
unique_entries <- clean_df |> group_by(id) |>  summarise(date = max(date))
unique_entries

# A tibble: 2 × 2
# id    date      
# <chr> <date>    
#   a     2023-08-21
#   c     2023-05-20