Home Merge rows in 1 df by ID and create new columns of differing data in R (in very large df)

Questions

Merge rows in 1 df by ID and create new columns of differing data in R (in very large df)

August 19, 2022

I have some duplicate IDs in my df, but I want only 1 row per ID. I cannot use unique() or distinct() because then some data would be erased, as the rows are not identical.

Please see this example:

# The style of df I have 
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
                 Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
                 Other_data1= c(22,22,32,44,24,24,24,NA), 
                 Other_data2=c("a","a","b","c","d","d","d","e"))
df
#   IDs Intervention Other_data1 Other_data2
# 1   1 Progesterone          22           a
# 2   1       Stitch          22           a
# 3   2         <NA>          32           b
# 4   3       Stitch          44           c
# 5   4 Progesterone          24           d
# 6   4       Stitch          24           d
# 7   4      Pessary          24           d
# 8   5 Progesterone          NA           e

So if I used unique() I would lose the full information in the df$Intervention column.

Please could someone let me know how I can get the df into this format:

# The style of df I want
df_I_want <- data.frame(IDs=c(1,2,3,4,5),
                        Progesterone=c("Yes", NA, "No", "Yes", "Yes"),
                        Stitch=c("Yes", NA, "Yes", "Yes", "No"),
                        Pessary=c("No", NA, "No", "Yes", "No"),
                        Other_data1= c(22,32,44,24,NA), 
                        Other_data2=c("a","b","c","d","e"))
df_I_want
#   IDs Progesterone Stitch Pessary Other_data1 Other_data2
# 1   1          Yes    Yes      No          22           a
# 2   2         <NA>   <NA>    <NA>          32           b
# 3   3           No    Yes      No          44           c
# 4   4          Yes    Yes     Yes          24           d
# 5   5          Yes     No      No          NA           e

My real df contains thousands of rows x hundreds of columns, so I have many cases of df$Other_data so I cannot really manually type out excluding these rows when reshaping the df. But there is only 1 column where the data differs by the row, as in the above example with df$Intervention.

(I will upvote and tick the answer which works).

>Solution :

Here is another pivot_wider solution, but here I use mutate and case_when to identify their corresponding values under the newly expanded columns.

If all of the three newly expanded columns are NA, they should remain NA. Otherwise, treat NA as "No" and non-NA as "Yes".

Note that within across(), you should input the column position (or column names) of the newly expanded columns (e.g. Progesterone, Stitch and Pessary are newly created, and they are in position 4 to 6, therefore 4:6).

library(tidyverse)

df %>% 
  pivot_wider(names_from = Intervention, values_from = Intervention) %>% 
  select(-"NA") %>% 
  mutate(across(4:6, ~case_when(rowSums(is.na(across(4:6))) == 3 ~ NA_character_,
                                is.na(.x) ~ "No",
                                !is.na(.x) ~ "Yes")))

# A tibble: 5 × 6
    IDs Other_data1 Other_data2 Progesterone Stitch Pessary
  <dbl>       <dbl> <chr>       <chr>        <chr>  <chr>  
1     1          22 a           Yes          Yes    No     
2     2          32 b           NA           NA     NA     
3     3          44 c           No           Yes    No     
4     4          24 d           Yes          Yes    Yes    
5     5          NA e           Yes          No     No