I have some duplicate IDs in my df, but I want only 1 row per ID. I cannot use unique() or distinct() because then some data would be erased, as the rows are not identical.
Please see this example:
# The style of df I have
df <- data.frame(IDs=c(1,1,2,3,4,4,4,5),
Intervention=c("Progesterone", "Stitch", NA, "Stitch", "Progesterone", "Stitch", "Pessary", "Progesterone"),
Other_data1= c(22,22,32,44,24,24,24,NA),
Other_data2=c("a","a","b","c","d","d","d","e"))
df
# IDs Intervention Other_data1 Other_data2
# 1 1 Progesterone 22 a
# 2 1 Stitch 22 a
# 3 2 <NA> 32 b
# 4 3 Stitch 44 c
# 5 4 Progesterone 24 d
# 6 4 Stitch 24 d
# 7 4 Pessary 24 d
# 8 5 Progesterone NA e
So if I used unique() I would lose the full information in the df$Intervention column.
Please could someone let me know how I can get the df into this format:
# The style of df I want
df_I_want <- data.frame(IDs=c(1,2,3,4,5),
Progesterone=c("Yes", NA, "No", "Yes", "Yes"),
Stitch=c("Yes", NA, "Yes", "Yes", "No"),
Pessary=c("No", NA, "No", "Yes", "No"),
Other_data1= c(22,32,44,24,NA),
Other_data2=c("a","b","c","d","e"))
df_I_want
# IDs Progesterone Stitch Pessary Other_data1 Other_data2
# 1 1 Yes Yes No 22 a
# 2 2 <NA> <NA> <NA> 32 b
# 3 3 No Yes No 44 c
# 4 4 Yes Yes Yes 24 d
# 5 5 Yes No No NA e
My real df contains thousands of rows x hundreds of columns, so I have many cases of df$Other_data so I cannot really manually type out excluding these rows when reshaping the df. But there is only 1 column where the data differs by the row, as in the above example with df$Intervention.
(I will upvote and tick the answer which works).
>Solution :
Here is another pivot_wider solution, but here I use mutate and case_when to identify their corresponding values under the newly expanded columns.
If all of the three newly expanded columns are NA, they should remain NA. Otherwise, treat NA as "No" and non-NA as "Yes".
Note that within across(), you should input the column position (or column names) of the newly expanded columns (e.g. Progesterone, Stitch and Pessary are newly created, and they are in position 4 to 6, therefore 4:6).
library(tidyverse)
df %>%
pivot_wider(names_from = Intervention, values_from = Intervention) %>%
select(-"NA") %>%
mutate(across(4:6, ~case_when(rowSums(is.na(across(4:6))) == 3 ~ NA_character_,
is.na(.x) ~ "No",
!is.na(.x) ~ "Yes")))
# A tibble: 5 × 6
IDs Other_data1 Other_data2 Progesterone Stitch Pessary
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 22 a Yes Yes No
2 2 32 b NA NA NA
3 3 44 c No Yes No
4 4 24 d Yes Yes Yes
5 5 NA e Yes No No