pivot_longer: pattern based on column names as input

March 13, 2022

I have a wide data frame with columns of the following pattern: a1_var1, a2_var1, a3_var1, a1_var2, a2_var2, a3_var2, etc. I want to pivot this to longer, creating a new variable named a that contains the values of the number behind the "a" (1, 2, 3) and for all values of ax_var1 to be contained in the column var1 (likewise for var2).

Consider this example data:

df <- data.frame(`id` = seq(1:3),
                 a1_var1 = c(111, 211, 311),
                 a2_var1 = c(121, 221, 321),
                 a3_var1 = c(131, 231, 331),
                 a1_var2 = c(112, 212, 312),
                 a2_var2 = c(122, 222, 322),
                 a3_var2 = c(132, 232, 332)) # first number is ID, second number is "a[x]" value, third number is var[x]

I would now like to transform the data to long format with the following columns: id (not unique per row anymore), a (containing the number from which the variable was taken, e.g. a1_var2: a = 1), var1 (containing all values from ax_var1 columns, i.e. x11, x21, x31), and var2 (containing all values from ax_var2 columns, i.e. x12, x22, x32).

I want to achieve the following structure:

id   a   var1 var2
1    1    111  112
1    2    121  122
1    3    131  132
2    1    211  212
2    2    221  222
2    3    231  232
3    1    311  312
3    2    321  322
3    3    331  332

So far, I’m hard-coding this for every varx like this:

df %>% select(-c(ends_with("var2"))) %>% 
  pivot_longer(cols = c("a1_var1", "a2_var1", "a3_var1"), names_to = "a", values_to = "var1") %>% 
  mutate(a=str_extract(a, "a\\d"),
         a=str_extract(a, "\\d"))

but since I have many varx columns, that’s kind of bothersome – can someone point out a better way of achieving the above outcome?

>Solution :

We can do this in pivot_longer itself – i.e. specify the names_toas c("a", ".value") where ‘a’ will be column name of the prefix substring from the columns to be reshaped and .value represents the column values. In the names_pattern, capture the substring of column name i.e. digits (\\d+) after the ‘a’ and the second capture group after the _

library(dplyr)
library(tidyr)
df %>% 
  pivot_longer(cols = -id, names_to = c("a", ".value"), 
      names_pattern = "a(\\d+)_(.*)")

-output

# A tibble: 9 × 4
     id a      var1  var2
  <int> <chr> <dbl> <dbl>
1     1 1       111   112
2     1 2       121   122
3     1 3       131   132
4     2 1       211   212
5     2 2       221   222
6     2 3       231   232
7     3 1       311   312
8     3 2       321   322
9     3 3       331   332