Need to extract the first digit of the following phrases
drink_type_1_Day_7 and drink_vol_1_Day_7 the digits will vary between 1-10
This is to facilitate a pivot longer move. In the absence of Day_7 I have used the expression (.*)(.*) to achieve this. When I construct a regex expression and place it into names_pattern such as \d+
I get the following error: regex should define 2 groups; 1 found.
Is one approach to create a regex that forms 2 groups
Group 1 contains the first digit
Group 2 contains everything else ?
If so any advice would be much appreciated
library(tidyr)
study_id <- c(1,2,3,4,5,6)
drink_type_1_Day_7 <-c(1,2,3,1,2,3)
drink_vol_1_Day_7 <- c(100,500,1000,100,500,750)
drink_type_2_Day_7 <-c(NA,3,2,1,NA,3)
drink_vol_2_Day_7 <- c(NA,125,300,200,NA,500)
drink_df <- data.frame(study_id,drink_type_1_Day_7,drink_vol_1_Day_7,drink_type_2_Day_7,drink_vol_2_Day_7)
drink_df <- drink_df %>% mutate_at(c('drink_type_1_Day_7','drink_type_2_Day_7'), as.factor)
drink_df_pivot <- drink_df %>%
pivot_longer(cols = starts_with("drink"),
names_to = c(".value", "drink"),
names_pattern = " ") %>%
dplyr::group_by(study_id, drink_type) %>%
dplyr:: summarise(drink_vol = sum(na_if(drink_vol,0), na.rm = TRUE), .groups = "drop") %>% pivot_wider(names_from = drink_type, values_from = drink_vol,
names_prefix = "drink_vol_", values_fill = 0) %>%
ungroup() %>%
select(-contains("NA"))```
>Solution :
We could either create three columns from the substring or two columns by not capturing the inner digits (based on the summarisation step) in pivot_longer , by capturing the non-digit characters as first group (([^0-9]+)) followed by the _, capture the inner digits ((\\d+)), followed by _ and capture the rest of the characters for the ‘day’ column
library(dplyr)
library(tidyr)
drink_df %>%
pivot_longer(cols = starts_with("drink"),
names_to = c(".value", "digit", "day"),
names_pattern = "^([^0-9]+)_(\\d+)_(.*)",
values_drop_na = TRUE) %>%
group_by(study_id, drink_type) %>%
summarise(drink_vol = sum(drink_vol, na.rm = TRUE), .groups = 'drop')
-output
# A tibble: 8 × 3
study_id drink_type drink_vol
<dbl> <fct> <dbl>
1 1 1 100
2 2 2 500
3 2 3 125
4 3 2 300
5 3 3 1000
6 4 1 300
7 5 2 500
8 6 3 1250