Home is the `names_sep` argument in `tidyr::pivot_longer` flexible on string splitting?

Questions

is the `names_sep` argument in `tidyr::pivot_longer` flexible on string splitting?

May 18, 2022

I have some random effect coefficients extracted from a R model object. For a random intercept, they look like this:

xx <- data.frame(
    `Estimate.Intercept` = c(-0.1, -0.2), 
    `Est.Error.Intercept` = c(0.7, 0.8), 
    `Q5.Intercept` = c(-1.5, -1.4), 
    `Q95.Intercept` = c(0.7, 0.8)
)

I’m formatting the data for a .csv report and trying to generate a ‘long’ data.frame/tibble with term_type taken from the first part of the column name and term taken from the second part. It mostly works with pivot_longer from the tidyr package:

tidyr::pivot_longer(
    data = xx, 
    cols = everything(), 
    names_sep = '\\.', 
    names_to = c('term_type', 'term'), 
    values_to = 'term_val'
)

The result looks like this:

# A tibble: 8 x 3
  term_type term      term_val
  <chr>     <chr>        <dbl>
1 Estimate  Intercept   -0.140
2 Est       Error        0.775
3 Q5        Intercept   -1.57 
4 Q95       Intercept    0.773
5 Estimate  Intercept   -0.140
6 Est       Error        0.777
7 Q5        Intercept   -1.55 
8 Q95       Intercept    0.792

But it throws this warning:

Warning message:
Expected 2 pieces. Additional pieces discarded in 1 rows [2].

Can I use the names_sep term to specify that I want the second index of the split string, but only for the second column? i.e. I want Error instead of Est. I’ve fixed it for now using an ifelse, but I’m wondering if it can be done within the call itself. Myy instinct is there’s some clever regex, or perhaps something using stringr, but I’m stumped for now…

>Solution :

There are multiple . in some of the column names (Est.Error.Intercept). It may be better to use names_pattern to capture groups ((...)) that doesn’t include any . as characters ([^.]+). In addition, specify the end of string with $

tidyr::pivot_longer(
    data = xx, 
    cols = everything(), 
     names_pattern = "([^.]+)\\.([^.]+)$", 
    names_to = c('term_type', 'term'), 
    values_to = 'term_val'
)

-output

# A tibble: 8 × 3
  term_type term      term_val
  <chr>     <chr>        <dbl>
1 Estimate  Intercept     -0.1
2 Error     Intercept      0.7
3 Q5        Intercept     -1.5
4 Q95       Intercept      0.7
5 Estimate  Intercept     -0.2
6 Error     Intercept      0.8
7 Q5        Intercept     -1.4
8 Q95       Intercept      0.8

"([^.]+)\\.([^.]+)$" – captures as two groups 1) ([^.]+) – one or more characters that are not a ., followed by a . (\\.) and 2) the second set of characters that are not a . till the end ($) of the string.