Home Extract first digit (in the middle of a string) in pivot_longer: names_pattern

Questions

Extract first digit (in the middle of a string) in pivot_longer: names_pattern

April 27, 2023

Need to extract the first digit of the following phrases
drink_type_1_Day_7 and drink_vol_1_Day_7 the digits will vary between 1-10

This is to facilitate a pivot longer move. In the absence of Day_7 I have used the expression (.*)(.*) to achieve this. When I construct a regex expression and place it into names_pattern such as \d+

I get the following error: regex should define 2 groups; 1 found.

Is one approach to create a regex that forms 2 groups
Group 1 contains the first digit
Group 2 contains everything else ?

If so any advice would be much appreciated

library(tidyr)


study_id <- c(1,2,3,4,5,6)
drink_type_1_Day_7 <-c(1,2,3,1,2,3)
drink_vol_1_Day_7 <- c(100,500,1000,100,500,750)
drink_type_2_Day_7 <-c(NA,3,2,1,NA,3)
drink_vol_2_Day_7 <- c(NA,125,300,200,NA,500)

drink_df  <- data.frame(study_id,drink_type_1_Day_7,drink_vol_1_Day_7,drink_type_2_Day_7,drink_vol_2_Day_7) 

drink_df <- drink_df %>% mutate_at(c('drink_type_1_Day_7','drink_type_2_Day_7'), as.factor) 

drink_df_pivot <- drink_df %>%
  pivot_longer(cols = starts_with("drink"), 
               names_to = c(".value", "drink"),
               names_pattern = "         ") %>% 
  dplyr::group_by(study_id, drink_type) %>% 
  dplyr:: summarise(drink_vol = sum(na_if(drink_vol,0), na.rm = TRUE), .groups = "drop") %>% pivot_wider(names_from = drink_type, values_from = drink_vol, 
              names_prefix = "drink_vol_", values_fill = 0) %>% 
  ungroup() %>%
  select(-contains("NA"))```

>Solution :

We could either create three columns from the substring or two columns by not capturing the inner digits (based on the summarisation step) in pivot_longer , by capturing the non-digit characters as first group (([^0-9]+)) followed by the _, capture the inner digits ((\\d+)), followed by _ and capture the rest of the characters for the ‘day’ column

library(dplyr)
library(tidyr)
drink_df %>% 
   pivot_longer(cols = starts_with("drink"),       
          names_to = c(".value", "digit", "day"),   
             names_pattern = "^([^0-9]+)_(\\d+)_(.*)",  
values_drop_na = TRUE) %>% 
  group_by(study_id, drink_type) %>%
 summarise(drink_vol = sum(drink_vol, na.rm = TRUE), .groups = 'drop')

-output

# A tibble: 8 × 3
  study_id drink_type drink_vol
     <dbl> <fct>          <dbl>
1        1 1                100
2        2 2                500
3        2 3                125
4        3 2                300
5        3 3               1000
6        4 1                300
7        5 2                500
8        6 3               1250