I have a dataset
structure(list(country = c("US", "US", "US", "US", "US", "US",
"UK", "UK"), ID = c(11, 11, 11, 11, 11, 11,
12, 12), entry = c("2017-04-25 09:01:25", "2017-04-21 09:00:24", "2017-02-19 08:09:57", "2017-01-24 16:55:56", "2016-06-27 14:00:46",
"2016-06-27 13:55:28", "2016-08-30 10:35:29", "2016-06-28 12:35:39"
), out = c("2017-04-21 09:41:00", "2017-04-21 09:41:00",
"2017-04-21 09:41:00", "2017-04-21 09:41:00", "2016-07-29 11:36:22",
"2016-07-29 11:36:12", NA, NA
)), class = "data.frame", row.names = c(NA, -8L))
And the output is something like this
country ID entry out
1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00
2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00
3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00
4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00
5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22
6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12
7 UK 12 2016-08-30 10:35:29 <NA>
8 UK 12 2016-06-28 12:35:39 <NA>
What I would like to do is to remove duplicate dates for the same group of IDs and if there’s a same date but with a different h:m:s value, then take the latest date of it, so it would look something like this:
country ID entry out unique_out
<chr> <dbl> <chr> <chr> <chr>
1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 09:41:00
2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 NA
3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 NA
4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 NA
5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 11:36:22
6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 NA
7 UK 12 2016-08-30 10:35:29 NA NA
8 UK 12 2016-06-28 12:35:39 NA NA
I have used this data %>% group_by(ID) %>% mutate(unique_out=replace(out, duplicated(out), NA))
to filter out the duplicates but cannot figure out how to keep the latest time of the same day within a group of ID.
Many thanks if you could help me with this!
>Solution :
Try this:
dat %>%
mutate(
across(c(entry, out), as.POSIXct),
out_date = as.Date(out)
) %>%
group_by(ID, out_date) %>%
mutate(unique_out = if_else(row_number() %in% which.max(out), out, out[NA])) %>%
ungroup()
# # A tibble: 8 x 6
# country ID entry out out_date unique_out
# <chr> <dbl> <dttm> <dttm> <date> <dttm>
# 1 US 11 2017-04-25 09:01:25 2017-04-21 09:41:00 2017-04-21 2017-04-21 09:41:00
# 2 US 11 2017-04-21 09:00:24 2017-04-21 09:41:00 2017-04-21 NA
# 3 US 11 2017-02-19 08:09:57 2017-04-21 09:41:00 2017-04-21 NA
# 4 US 11 2017-01-24 16:55:56 2017-04-21 09:41:00 2017-04-21 NA
# 5 US 11 2016-06-27 14:00:46 2016-07-29 11:36:22 2016-07-29 2016-07-29 11:36:22
# 6 US 11 2016-06-27 13:55:28 2016-07-29 11:36:12 2016-07-29 NA
# 7 UK 12 2016-08-30 10:35:29 NA NA NA
# 8 UK 12 2016-06-28 12:35:39 NA NA NA
Walk-through:
- in order to use
which.maxlater, we need number-like timestamps, so we first convert toPOSIXttimestamps; this step can be omitted if you need to keep your timestamps as strings, though you’ll need a few more steps (starting without == max(out)instead ofwhich.max, but then you need to de-duplicate it); - in order to get the latest time per day, it will help to group on the date (as well as
ID), so we generate a date-only variant ofout; - group by both date and ID, find which is the first of the
max(out)by way ofwhich.max