So I have a data frame such as this, containing species names, IDs, markers, and size of DNA sequences:
Species | ID| marker| size
-----------------------------------------------------
Tilapia guineensis | 1| co1 | 400
Tilapia guineensis | 1| co2 | 300
Tilapia guineensis | 1| co2 | 700
Tilapia guineensis | 1| co2 | 900
Tilapia zillii | 2| co5 | 600
Tilapia zillii | 2| coi8| 200
Tilapia zillii | 2| coi8| 500
Eutrigla gurnardus | 5| co1 | 100
Eutrigla gurnardus | 5| co2 | 200
Sprattus sprattus | 6| co3 | 300
Sprattus sprattus | 6| co4 | 400
Sardinia pichardus | 7| co1 | 800
Sardinia pichardus | 7| co2 | 800
I would like to keep only one row for each ID, but I want to keep the row that has the highest value in the size column, regardless of the species and marker columns. If the rows have equal ID and equal size, I would like to just keep one randomly.
My output would be:
Species | ID| marker| size
-----------------------------------------------------
Tilapia guineensis | 1| co2 | 900
Tilapia zillii | 2| co5 | 600
Eutrigla gurnardus | 5| co2 | 200
Sprattus sprattus | 6| co4 | 400
Sardinia pichardus | 7| co2 | 800
>Solution :
A possible solution:
library(dplyr)
df %>%
group_by(ID) %>%
slice_max(size)
#> # A tibble: 4 × 4
#> # Groups: ID [4]
#> Species ID marker size
#> <chr> <int> <chr> <int>
#> 1 Tilapia guineensis 1 co2 900
#> 2 Tilapia zillii 2 co5 600
#> 3 Eutrigla gurnardus 5 co2 200
#> 4 Sprattus sprattus 6 co4 400