I have repeat rows of ethnicity data of random people with dates on when the ethnicity was assigned. I would like to assign each person only one ethnicity category (white, asian, black, other, mixed) based on these conditions: (1) if the person has multiple ethnicity, then the most common one (highest ethnicity_n) will be assigned. (2) if the patient has multiple ethnicity in equal numbers (e.g. 1 mixed, 1 Asian, 1 Other) then the most recent one will be assigned. I organized my data in a way so that I have a count of each ethnicity per patient and I wrote code to make the date of ethnicity in descending order. However, when I run code to get the first row of my organized table I end up with random ethnicity assigned for each person.
| person | ethnicity | ethnicity_n | ethnicity_date |
|---|---|---|---|
| 1 | white | 4 | 04/09/2019 |
| 1 | white | 4 | 04/09/2018 |
| 1 | white | 4 | 04/09/2017 |
| 1 | white | 4 | 04/09/2016 |
| 1 | other | 1 | 04/09/2015 |
| 2 | asian | 1 | 04/09/2019 |
| 2 | other | 1 | 04/09/2018 |
| 2 | mixed | 1 | 04/09/2017 |
| 3 | black | 2 | 04/09/2016 |
| 3 | black | 2 | 04/09/2015 |
I used this code to make the table above
df %>%
group_by(person,ethnicity_n,ethnicity_date) %>%
arrange(person,ethnicity_n,desc(ethnicity_date))
I want the final table to look like this
| person | ethnicity | ethnicity_n | ethnicity_date |
|---|---|---|---|
| 1 | white | 4 | 04/09/2019 |
| 2 | asian | 1 | 04/09/2019 |
| 3 | black | 2 | 04/09/2016 |
I tried all of these codes to get the second table but each time the ethnicity is not meeting the conditions that i want
df %>%
group_by(person,ethnicity_n,ethnicity_date) %>%
arrange(person,ethnicity_n,desc(ethnicity_date)) %>%
slice(1L)
df %>%
group_by(person) %>%
arrange(person,ethnicity_n,desc(ethnicity_date)) %>%
slice(1L)
df %>%
group_by(person,ethnicity_n,ethnicity_date) %>%
arrange(person,ethnicity_n,desc(ethnicity_date)) %>%
filter(row_number()==1)
df %>%
group_by(person) %>%
arrange(person,ethnicity_n,desc(ethnicity_date)) %>%
filter(row_number()==1)
Data:
df <- structure(list(person = c(
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L
), ethnicity = c(
"white", "white", "white", "white", "other",
"asian", "other", "mixed", "black", "black"
), ethnicity_n = c(
4L,
4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L
), ethnicity_date = c(
"04/09/2019",
"04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015", "04/09/2019",
"04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015"
)), class = "data.frame", row.names = c(
NA,
-10L
))
>Solution :
Your main problem was that your ethnicity_date was a character vector, rather than a Date.
I have assumed it’s in month-day-year format but if it’s day-month-year format you can just change format = "%m/%d/%Y" to format = "%d/%m/%Y".
Note that as we are using max() functions for ethnicity_date and ethnicity_n, it is not necessary to arrange() the data in order first. You will need to supply na.rm=TRUE to the max() function if you have any NA values in either column.
df |>
mutate(
ethnicity_date = as.Date(
ethnicity_date,
format = "%m/%d/%Y"
)
) |>
group_by(person) |>
filter(
ethnicity_n == max(ethnicity_n)
) |>
filter(
ethnicity_date == max(ethnicity_date)
) |>
slice(1L) # in case there are still ties
# # A tibble: 3 x 4
# # Groups: person [3]
# person ethnicity ethnicity_n ethnicity_date
# <int> <chr> <int> <date>
# 1 1 white 4 2019-04-09
# 2 2 asian 1 2019-04-09
# 3 3 black 2 2016-04-09
I kept your slice(1L) at the end in case there is an individual with more than one row with the same ethnicity_n and ethnicity_date, but you can remove it if you want to keep both rows in that case.