Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

is the `names_sep` argument in `tidyr::pivot_longer` flexible on string splitting?

I have some random effect coefficients extracted from a R model object. For a random intercept, they look like this:

xx <- data.frame(
    `Estimate.Intercept` = c(-0.1, -0.2), 
    `Est.Error.Intercept` = c(0.7, 0.8), 
    `Q5.Intercept` = c(-1.5, -1.4), 
    `Q95.Intercept` = c(0.7, 0.8)
)

I’m formatting the data for a .csv report and trying to generate a ‘long’ data.frame/tibble with term_type taken from the first part of the column name and term taken from the second part. It mostly works with pivot_longer from the tidyr package:

tidyr::pivot_longer(
    data = xx, 
    cols = everything(), 
    names_sep = '\\.', 
    names_to = c('term_type', 'term'), 
    values_to = 'term_val'
)

The result looks like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

# A tibble: 8 x 3
  term_type term      term_val
  <chr>     <chr>        <dbl>
1 Estimate  Intercept   -0.140
2 Est       Error        0.775
3 Q5        Intercept   -1.57 
4 Q95       Intercept    0.773
5 Estimate  Intercept   -0.140
6 Est       Error        0.777
7 Q5        Intercept   -1.55 
8 Q95       Intercept    0.792

But it throws this warning:

Warning message:
Expected 2 pieces. Additional pieces discarded in 1 rows [2].

Can I use the names_sep term to specify that I want the second index of the split string, but only for the second column? i.e. I want Error instead of Est. I’ve fixed it for now using an ifelse, but I’m wondering if it can be done within the call itself. Myy instinct is there’s some clever regex, or perhaps something using stringr, but I’m stumped for now…

>Solution :

There are multiple . in some of the column names (Est.Error.Intercept). It may be better to use names_pattern to capture groups ((...)) that doesn’t include any . as characters ([^.]+). In addition, specify the end of string with $

tidyr::pivot_longer(
    data = xx, 
    cols = everything(), 
     names_pattern = "([^.]+)\\.([^.]+)$", 
    names_to = c('term_type', 'term'), 
    values_to = 'term_val'
)

-output

# A tibble: 8 Ă— 3
  term_type term      term_val
  <chr>     <chr>        <dbl>
1 Estimate  Intercept     -0.1
2 Error     Intercept      0.7
3 Q5        Intercept     -1.5
4 Q95       Intercept      0.7
5 Estimate  Intercept     -0.2
6 Error     Intercept      0.8
7 Q5        Intercept     -1.4
8 Q95       Intercept      0.8

"([^.]+)\\.([^.]+)$" – captures as two groups 1) ([^.]+) – one or more characters that are not a ., followed by a . (\\.) and 2) the second set of characters that are not a . till the end ($) of the string.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading