Get the latest time within the same date by group in R

July 12, 2022

I have a dataset

structure(list(country = c("US", "US", "US", "US", "US", "US",
                                        "UK", "UK"), ID = c(11, 11, 11, 11, 11, 11,
                                                                    12, 12), entry = c("2017-04-25 09:01:25", "2017-04-21 09:00:24", "2017-02-19 08:09:57", "2017-01-24 16:55:56", "2016-06-27 14:00:46",
                                                                                                    "2016-06-27 13:55:28", "2016-08-30 10:35:29", "2016-06-28 12:35:39"
                                                                    ), out = c("2017-04-21 09:41:00", "2017-04-21 09:41:00",
                                                                                           "2017-04-21 09:41:00", "2017-04-21 09:41:00", "2016-07-29 11:36:22",
                                                                                           "2016-07-29 11:36:12", NA, NA
                                                                    )), class = "data.frame", row.names = c(NA, -8L))

And the output is something like this

     country ID               entry                 out
1         US 11 2017-04-25 09:01:25 2017-04-21 09:41:00
2         US 11 2017-04-21 09:00:24 2017-04-21 09:41:00
3         US 11 2017-02-19 08:09:57 2017-04-21 09:41:00
4         US 11 2017-01-24 16:55:56 2017-04-21 09:41:00
5         US 11 2016-06-27 14:00:46 2016-07-29 11:36:22
6         US 11 2016-06-27 13:55:28 2016-07-29 11:36:12
7         UK 12 2016-08-30 10:35:29                <NA>
8         UK 12 2016-06-28 12:35:39                <NA>

What I would like to do is to remove duplicate dates for the same group of IDs and if there’s a same date but with a different h:m:s value, then take the latest date of it, so it would look something like this:

  country    ID entry               out                 unique_out     
   <chr>   <dbl> <chr>               <chr>               <chr>              
 1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
 2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 NA                 
 3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 NA                 
 4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 NA                 
 5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
 6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 NA
 7 UK         12 2016-08-30 10:35:29 NA                  NA                 
 8 UK         12 2016-06-28 12:35:39 NA                  NA

I have used this data %>% group_by(ID) %>% mutate(unique_out=replace(out, duplicated(out), NA))
to filter out the duplicates but cannot figure out how to keep the latest time of the same day within a group of ID.
Many thanks if you could help me with this!

>Solution :

Try this:

dat %>%
  mutate(
    across(c(entry, out), as.POSIXct),
    out_date = as.Date(out)
  ) %>%
  group_by(ID, out_date) %>%
  mutate(unique_out = if_else(row_number() %in% which.max(out), out, out[NA])) %>%
  ungroup()
# # A tibble: 8 x 6
#   country    ID entry               out                 out_date   unique_out         
#   <chr>   <dbl> <dttm>              <dttm>              <date>     <dttm>             
# 1 US         11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 2017-04-21 09:41:00
# 2 US         11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 NA                 
# 3 US         11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 NA                 
# 4 US         11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 NA                 
# 5 US         11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 2016-07-29 11:36:22
# 6 US         11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 NA                 
# 7 UK         12 2016-08-30 10:35:29 NA                  NA         NA                 
# 8 UK         12 2016-06-28 12:35:39 NA                  NA         NA

Walk-through:

in order to use which.max later, we need number-like timestamps, so we first convert to POSIXt timestamps; this step can be omitted if you need to keep your timestamps as strings, though you’ll need a few more steps (starting with out == max(out) instead of which.max, but then you need to de-duplicate it);
in order to get the latest time per day, it will help to group on the date (as well as ID), so we generate a date-only variant of out;
group by both date and ID, find which is the first of the max(out) by way of which.max