Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Extract first digit (in the middle of a string) in pivot_longer: names_pattern

Need to extract the first digit of the following phrases
drink_type_1_Day_7 and drink_vol_1_Day_7 the digits will vary between 1-10

This is to facilitate a pivot longer move. In the absence of Day_7 I have used the expression (.*)(.*) to achieve this. When I construct a regex expression and place it into names_pattern such as \d+

I get the following error: regex should define 2 groups; 1 found.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Is one approach to create a regex that forms 2 groups
Group 1 contains the first digit
Group 2 contains everything else ?

If so any advice would be much appreciated

library(tidyr)


study_id <- c(1,2,3,4,5,6)
drink_type_1_Day_7 <-c(1,2,3,1,2,3)
drink_vol_1_Day_7 <- c(100,500,1000,100,500,750)
drink_type_2_Day_7 <-c(NA,3,2,1,NA,3)
drink_vol_2_Day_7 <- c(NA,125,300,200,NA,500)

drink_df  <- data.frame(study_id,drink_type_1_Day_7,drink_vol_1_Day_7,drink_type_2_Day_7,drink_vol_2_Day_7) 

drink_df <- drink_df %>% mutate_at(c('drink_type_1_Day_7','drink_type_2_Day_7'), as.factor) 

drink_df_pivot <- drink_df %>%
  pivot_longer(cols = starts_with("drink"), 
               names_to = c(".value", "drink"),
               names_pattern = "         ") %>% 
  dplyr::group_by(study_id, drink_type) %>% 
  dplyr:: summarise(drink_vol = sum(na_if(drink_vol,0), na.rm = TRUE), .groups = "drop") %>% pivot_wider(names_from = drink_type, values_from = drink_vol, 
              names_prefix = "drink_vol_", values_fill = 0) %>% 
  ungroup() %>%
  select(-contains("NA"))```

>Solution :

We could either create three columns from the substring or two columns by not capturing the inner digits (based on the summarisation step) in pivot_longer , by capturing the non-digit characters as first group (([^0-9]+)) followed by the _, capture the inner digits ((\\d+)), followed by _ and capture the rest of the characters for the ‘day’ column

library(dplyr)
library(tidyr)
drink_df %>% 
   pivot_longer(cols = starts_with("drink"),       
          names_to = c(".value", "digit", "day"),   
             names_pattern = "^([^0-9]+)_(\\d+)_(.*)",  
values_drop_na = TRUE) %>% 
  group_by(study_id, drink_type) %>%
 summarise(drink_vol = sum(drink_vol, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 8 × 3
  study_id drink_type drink_vol
     <dbl> <fct>          <dbl>
1        1 1                100
2        2 2                500
3        2 3                125
4        3 2                300
5        3 3               1000
6        4 1                300
7        5 2                500
8        6 3               1250
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading