Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

subsetting the first row of repeat data based on multiple conditions in R

I have repeat rows of ethnicity data of random people with dates on when the ethnicity was assigned. I would like to assign each person only one ethnicity category (white, asian, black, other, mixed) based on these conditions: (1) if the person has multiple ethnicity, then the most common one (highest ethnicity_n) will be assigned. (2) if the patient has multiple ethnicity in equal numbers (e.g. 1 mixed, 1 Asian, 1 Other) then the most recent one will be assigned. I organized my data in a way so that I have a count of each ethnicity per patient and I wrote code to make the date of ethnicity in descending order. However, when I run code to get the first row of my organized table I end up with random ethnicity assigned for each person.

person ethnicity ethnicity_n ethnicity_date
1 white 4 04/09/2019
1 white 4 04/09/2018
1 white 4 04/09/2017
1 white 4 04/09/2016
1 other 1 04/09/2015
2 asian 1 04/09/2019
2 other 1 04/09/2018
2 mixed 1 04/09/2017
3 black 2 04/09/2016
3 black 2 04/09/2015

I used this code to make the table above

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) 

I want the final table to look like this

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

person ethnicity ethnicity_n ethnicity_date
1 white 4 04/09/2019
2 asian 1 04/09/2019
3 black 2 04/09/2016

I tried all of these codes to get the second table but each time the ethnicity is not meeting the conditions that i want

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  slice(1L)

df %>%
  group_by(person) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  slice(1L)

df %>%
  group_by(person,ethnicity_n,ethnicity_date) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  filter(row_number()==1)

df %>%
  group_by(person) %>%
  arrange(person,ethnicity_n,desc(ethnicity_date)) %>% 
  filter(row_number()==1)

Data:

df <- structure(list(person = c(
    1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L,
    3L
), ethnicity = c(
    "white", "white", "white", "white", "other",
    "asian", "other", "mixed", "black", "black"
), ethnicity_n = c(
    4L,
    4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L
), ethnicity_date = c(
    "04/09/2019",
    "04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015", "04/09/2019",
    "04/09/2018", "04/09/2017", "04/09/2016", "04/09/2015"
)), class = "data.frame", row.names = c(
    NA,
    -10L
))

>Solution :

Your main problem was that your ethnicity_date was a character vector, rather than a Date.

I have assumed it’s in month-day-year format but if it’s day-month-year format you can just change format = "%m/%d/%Y" to format = "%d/%m/%Y".

Note that as we are using max() functions for ethnicity_date and ethnicity_n, it is not necessary to arrange() the data in order first. You will need to supply na.rm=TRUE to the max() function if you have any NA values in either column.

df |>
    mutate(
        ethnicity_date = as.Date(
            ethnicity_date,
            format = "%m/%d/%Y"
        )
    ) |>
    group_by(person) |>
    filter(
        ethnicity_n == max(ethnicity_n)
    ) |>
    filter(
        ethnicity_date == max(ethnicity_date)
    ) |>
    slice(1L) # in case there are still ties

# # A tibble: 3 x 4
# # Groups:   person [3]
#   person ethnicity ethnicity_n ethnicity_date
#    <int> <chr>           <int> <date>
# 1      1 white               4 2019-04-09
# 2      2 asian               1 2019-04-09
# 3      3 black               2 2016-04-09 

I kept your slice(1L) at the end in case there is an individual with more than one row with the same ethnicity_n and ethnicity_date, but you can remove it if you want to keep both rows in that case.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading