How to delete entire rows from a dataframe based on the date the data was collected?

Let’s say I have this example dataframe (but a lot bigger)

df = data.frame(ID_number = c(111,111,111,22,22,33,33),
                date = c('2021-06-14','2021-06-12','2021-03-11',
                         '2021-05-20','2021-05-14',
                         '2018-04-20','2017-03-14'),
                answers = 1:7,
                sex = c('F','M','F','M','M','M','F') )

The output

  ID_number       date  answers sex
1       111 2021-06-14       1   F
2       111 2021-06-12       2   M
3       111 2021-03-11       3   F
4        22 2021-05-20       4   M
5        22 2021-05-14       5   M
6        33 2018-04-20       6   M
7        33 2017-03-14       7   F

we can see that there are 7 different members, but the one who created the dataframe has made a mistake and assigned the same ID_number to members 1,2 and 3. The same ID_number to members 4 and 5 and so on …

In the dataframe there is the data of the collection of the data of each member and I wish to only keep the member that has the earliest date. The resulted dataframe would look like this

  ID_number       date   answers sex
1       111 2021-03-11       3   F
2        22 2021-05-14       5   M
3        33 2017-03-14       7   F

Appreciate the help.

>Solution :

With slice_min:

library(dplyr)
df %>% 
  group_by(ID_number) %>% 
  slice_min(date)

In the dev. version, you can use inline grouping with .by:

devtools::install_github("tidyverse/dplyr")
df %>% 
  slice_min(date, .by = ID_number)

Leave a Reply