Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

pivot_wider when names_from are duplicated

I need help pivoting data and then removing unnecessary rows. I ran a query in SQL that returned a rather large output (60 columns and 2 million rows). A lot of the rows are near duplicates, with the only unique value coming from a "diagnosis" column. I would like to pivot those to new columns and then remove the duplicates. The wrinkle is that my names_from column has duplicates. Here is an example

id <- c("1","1","1","2","2","2")
priority <- c("primary","secondary","tertiary","primary","primary","secondary")
diagnosis <- c("depression","anxiety","anorexia","depression","anxiety","ptsd")

data <- data.frame(id,priority,diagnosis)
id priority diagnosis
1 primary depression
1 secondary anxiety
1 tertiary anorexia
2 primary depression
2 primary anxiety
2 secondary PTSD

When I try to pivot_wider, person 2 has both depression and anxiety listed in "primary", but I want them in separate columns.

data2 <- pivot_wider(data,names_from=priority,values_from = diagnosis)

Here is what it is showing me:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

id primary secondary tertiary
1 depression anxiety anorexia
2 c("depression,"anxiety") ptsd NULL

So how can I get it to separate out depression and anxiety for person 2, ideally into something like primary_1, primary_2?

>Solution :

Create a sequence by ‘id’, ‘primary’ and then do the reshaping

library(dplyr)
library(stringr)
library(tidyr)
library(data.table)
data %>% 
  mutate(priority = str_c(priority, '_', rowid(id, priority))) %>% 
  pivot_wider(names_from = priority, values_from = diagnosis)

-output

# A tibble: 2 × 5
  id    primary_1  secondary_1 tertiary_1 primary_2
  <chr> <chr>      <chr>       <chr>      <chr>    
1 1     depression anxiety     anorexia   <NA>     
2 2     depression ptsd        <NA>       anxiety 

Or another option is to return the list and then use unnest_wider on the list columns

data %>%
   pivot_wider(names_from = priority, values_from = diagnosis,
    values_fn = list) %>%
   unnest_wider(where(is.list), names_sep = "_")

-output

# A tibble: 2 × 5
  id    primary_1  primary_2 secondary_1 tertiary_1
  <chr> <chr>      <chr>     <chr>       <chr>     
1 1     depression <NA>      anxiety     anorexia  
2 2     depression anxiety   ptsd        <NA>      

If we need to rename with 1, 2 only when the lengths are greater than 1, do the unnest separately after using unnest_wider

data %>%
   pivot_wider(names_from = priority, values_from = diagnosis,
    values_fn = list) %>% 
   unnest_wider(where(~ is.list(.x) && max(lengths(.x)) > 1),  
         names_sep = "_") %>% 
   unnest(where(is.list))

-output

# A tibble: 2 × 5
  id    primary_1  primary_2 secondary tertiary
  <chr> <chr>      <chr>     <chr>     <chr>   
1 1     depression <NA>      anxiety   anorexia
2 2     depression anxiety   ptsd      <NA>    

If the data is really big, then dcast may be more efficient

dcast(setDT(data), id ~ paste0(priority,  "_", rowid(id, priority)), 
   value.var = 'diagnosis')
Key: <id>
       id  primary_1 primary_2 secondary_1 tertiary_1
   <char>     <char>    <char>      <char>     <char>
1:      1 depression      <NA>     anxiety   anorexia
2:      2 depression   anxiety        ptsd       <NA>
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading