I would like to identify and remove duplicates from my dataset. The issue is that I have two different criteria. First, the main criteria that I am using is the most recent time. However, not all duplicates have dates. On the cases dates are missing, I will have to check one by one to see which row I am going to keep. So, this is what I need:
Suppose I have the following data:
df <- data.frame(id = c("a", "a", "b","b", "c","c"),
date = c("2023-05-16", "2023-08-21", "2023-06-05", NA, "2023-05-07", "2023-05-20"))
library(lubridate)
df$date <- ymd(df$date)
class(df$date)
[1] "Date"
df
id date
1 a 2023-05-16
2 a 2023-08-21
3 b 2023-06-05
4 b <NA>
5 c 2023-05-07
6 c 2023-05-20
First, I need to identify all duplicates that contain NA to compare the rows and see which row I will keep. I need a resulting dataframe similar to this, it contains the duplicate and the unique value:
id date
1 b 2023-06-05
2 b <NA>
Notice that I have both rows so I can compare them. I am keeping the examples simple, but I am not sure on my actual data frame how many duplicates per id I have. Next, I will be left with the remaining dataframe:
id date
1 a 2023-05-16
2 a 2023-08-21
3 c 2023-05-07
4 c 2023-05-20
I need to keep only the most recent of the values, and I will end with something like this:
id date
1 a 2023-08-21
2 c 2023-05-20
Let me know if something isn’t clear and you need more clarification.
>Solution :
Part 1: Identify all duplicates that contain NA and keep them in a separate df for review:
#Ids of NA dates:
ids_na <- df |> filter(is.na(date)) |> pull(id)
#Dfs of duplicates with NA:
is_na <- df |> filter(id %in% ids_na)
is_na
# id date
# b 2023-06-05
# b <NA>
Part 2: keep most recent entry of duplicated values
# Remove dupes with NAs from df
clean_df <- df |> filter(!(id %in% ids_na))
# Get the most recent by grouping
unique_entries <- clean_df |> group_by(id) |> summarise(date = max(date))
unique_entries
# A tibble: 2 × 2
# id date
# <chr> <date>
# a 2023-08-21
# c 2023-05-20