I have data with columns "ID" and "value" in which ID might be repeated. I would like to find all rows which have duplicate IDs and just keep the one with the higher value.
mydf <- data.frame(ID = c(1,2,2,3,4), value = c(5, 8, 20, 18,15))
I am working w dplyr. So far I can find the duplicates
find_dup <- function(dataset, var) {
dataset %>% group_by({{var}}) %>% filter(n() >1) %>% ungroup %>% arrange({{var}})
}
find_dup(mydf, ID)
But am having trouble with the replace step, not sure how to "point to" the larger value. Hoping to stay with a tidyverse solution for now if possible. Any thoughts welcome, Thx!
>Solution :
Rather than specifically identifying and removing duplicates, you could group_by ID and slice_max the top value in each group.
library(dplyr)
mydf <- data.frame(ID = c(1, 2, 2, 3, 4), value = c(5, 8, 20, 18, 15))
mydf %>%
group_by(ID) %>%
slice_max(value, n = 1) %>%
ungroup()
#> # A tibble: 4 x 2
#> ID value
#> <dbl> <dbl>
#> 1 1 5
#> 2 2 20
#> 3 3 18
#> 4 4 15
Created on 2023-08-07 with reprex v2.0.2