I have a dataset that looks like this:
Study_ID Stage
1 100 Early Stage
2 100 Stable
3 200 Stable
4 300 Early Stage
5 400 Early Stage
6 400 Stable
7 500 Early Stage
8 500 Stable
9 600 Stable
10 700 Early Stage
I would like to remove any Study IDs that are duplicates, but keep the entry where the patient is ‘stable’. In other words, I want to remove every duplicate study ID where the patient is ‘Early Stage’.
My desired output would look something like this:
Study_ID Stage
1 100 Stable
2 200 Stable
3 300 Early Stage
4 400 Stable
5 500 Stable
6 600 Stable
7 700 Early Stage
How can I go about doing this?
Reproducible data:
data<-data.frame(Study_ID=c("100","100","200","300","400","400","500","500","600","700"),Stage=c("Early Stage","Stable","Stable","Early Stage","Early Stage","Stable","Early Stage","Stable","Stable","Early Stage"))
>Solution :
You can use the following code:
data<-data.frame(Study_ID=c("100","100","200","300","400","400","500","500","600","700"),Stage=c("Early Stage","Stable","Stable","Early Stage","Early Stage","Stable","Early Stage","Stable","Stable","Early Stage"))
library(dplyr)
filter(data, !duplicated(Study_ID, fromLast = TRUE) | Stage !="Early Stage")
#> Study_ID Stage
#> 1 100 Stable
#> 2 200 Stable
#> 3 300 Early Stage
#> 4 400 Stable
#> 5 500 Stable
#> 6 600 Stable
#> 7 700 Early Stage
Created on 2022-06-30 by the reprex package (v2.0.1)