I have a df like this one:
| ID | FOLLOW-UP | RECURRENCE | RECURRENCE DATE |
|---|---|---|---|
| 15 | 1 | no | |
| 15 | 2 | yes | 2003-05-15 |
| 16 | 1 | no | |
| 29 | 1 | yes | 2002-06-10 |
| 29 | 2 | yes | 2002-12-16 |
| 34 | 1 | yes | 2009-11-15 |
| 34 | 2 | no | |
| 34 | 3 | yes | 2012-05-03 |
| 36 | 1 | no | |
| 36 | 2 | no |
With in the first column the patient identifier (ID), in the 2nd the follow-up number, in the 3rd column the occurrence of a recurrence (yes/no) and in the last column the date of recurrence (if recurrence = yes).
There are therefore repeated lines for several patients, if they have several follow-ups.
I’d like to obtain a df with only 1 line per patient.
- If the patient has had several recurrences, I’d like to keep the line with the oldest recurrence (date of first recurrence).
- If the patient has had several follow-ups with recurrence and without recurrence, I’d like to keep the line with the (oldest) recurrence.
- If the patient has several follow-ups without recurrence, I’d like to keep follow-up 1.
In the end, I’d like to get this df here:
| ID | FOLLOW_UP | RECURRENCE | RECURRENCE_DATE |
|---|---|---|---|
| 15 | 2 | yes | 2003-05-15 |
| 16 | 1 | no | |
| 29 | 1 | yes | 2002-06-10 |
| 34 | 1 | yes | 2009-11-15 |
| 36 | 1 | no |
Here’s a reproducible example :
df <- data.frame(ID = c("15", "15", "16","29", "29", "34","34", "34", "36","36"),
FOLLOW_UP = c("1", "2", "1","1", "2", "1","2", "3", "1","2"),
RECURRENCE = c("no", "yes", "no","yes", "yes", "yes","no", "yes", "no","no"),
RECURRENCE_DATE = c("", "2003-05-15", "","2002-06-10", "2002-12-16", "2009-11-15","", "2012-05-03", "","")
)
I thought of using the “group by” and “slice” functions but I don’t know how to apply them with conditions.
Thanks in advance
>Solution :
something like this:
df$RECURRENCE_DATE <- as.Date(df$RECURRENCE_DATE, format = "%Y-%m-%d")
df_filtered <- df %>%
group_by(ID) %>%
arrange(desc(RECURRENCE), RECURRENCE_DATE, FOLLOW_UP) %>%
slice_head(n = 1) %>%
ungroup()