subsetting the first row of repeat data based on multiple conditions in R

October 4, 2022

I have repeat rows of ethnicity data of random people with dates on when the ethnicity was assigned. I would like to assign each person only one ethnicity category (white, asian, black, other, mixed) based on these conditions: (1) if the person has multiple ethnicity, then the most common one (highest ethnicity_n) will be assigned. (2) if the patient has multiple ethnicity in equal numbers (e.g. 1 mixed, 1 Asian, 1 Other) then the most recent one will be assigned. I organized my data in a way so that I have a count of each ethnicity per patient and I wrote code to make the date of ethnicity in descending order. However, when I run code to get the first row of my organized table I end up with random ethnicity assigned for each person.

person	ethnicity	ethnicity_n	ethnicity_date
1	white	4	04/09/2019
1	white	4	04/09/2018
1	white	4	04/09/2017
1	white	4	04/09/2016
1	other	1	04/09/2015
2	asian	1	04/09/2019
2	other	1	04/09/2018
2	mixed	1	04/09/2017
3	black	2	04/09/2016
3	black	2	04/09/2015

I used this code to make the table above

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date))

I want the final table to look like this

person	ethnicity	ethnicity_n	ethnicity_date
1	white	4	04/09/2019
2	asian	1	04/09/2019
3	black	2	04/09/2016

I tried all of these codes to get the second table but each time the ethnicity is not meeting the conditions that i want

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  slice(1L)

df %>%
  group_by(person) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  slice(1L)

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  filter(row_number()==1)

df %>%
  group_by(person) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  filter(row_number()==1)

Data:

df <- structure(list(person = c(
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
    3L
), ethnicity = c(
    "white", "white", "white", "white", "other",
    "asian", "other", "mixed", "black", "black"
), ethnicity_n = c(
    4L,
    4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L
), ethnicity_date = c(
    "04/09/2019",
    "04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015", "04/09/2019",
    "04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015"
)), class = "data.frame", row.names = c(
    NA,
    -10L
))

>Solution :

Your main problem was that your ethnicity_date was a character vector, rather than a Date.

I have assumed it’s in month-day-year format but if it’s day-month-year format you can just change format = "%m/%d/%Y" to format = "%d/%m/%Y".

Note that as we are using max() functions for ethnicity_date and ethnicity_n, it is not necessary to arrange() the data in order first. You will need to supply na.rm=TRUE to the max() function if you have any NA values in either column.

df |>
    mutate(
        ethnicity_date = as.Date(
            ethnicity_date,
            format = "%m/%d/%Y"
        )
    ) |>
    group_by(person) |>
    filter(
        ethnicity_n == max(ethnicity_n)
    ) |>
    filter(
        ethnicity_date == max(ethnicity_date)
    ) |>
    slice(1L) # in case there are still ties

# # A tibble: 3 x 4
# # Groups:   person [3]
#   person ethnicity ethnicity_n ethnicity_date
#    <int> <chr>           <int> <date>
# 1      1 white               4 2019-04-09
# 2      2 asian               1 2019-04-09
# 3      3 black               2 2016-04-09

I kept your slice(1L) at the end in case there is an individual with more than one row with the same ethnicity_n and ethnicity_date, but you can remove it if you want to keep both rows in that case.