pivot_longer across multiple columns, how to use names_pattern parameter

June 30, 2023

df_wide <- structure(list(name = c("Crushers GC", "4Aces GC"), first_name = c("Charles", 
"Peter"), last_name = c("Howell III", "Uihlein"), first_name_1 = c("Paul", 
"Pat"), last_name_1 = c("Casey", "Perez"), first_name_2 = c("Bryson", 
"Dustin"), last_name_2 = c("DeChambeau", "Johnson"), first_name_3 = c("Anirban", 
"Patrick"), last_name_3 = c("Lahiri", "Reed")), row.names = c(NA, 
-2L), class = c("tbl_df", "tbl", "data.frame"))

name        first_name last_name  first_name_1 last_name_1 first_name_2 last_name_2 first_name_3 last_name_3
  <chr>       <chr>      <chr>      <chr>        <chr>       <chr>        <chr>       <chr>        <chr>      
1 Crushers GC Charles    Howell III Paul         Casey       Bryson       DeChambeau  Anirban      Lahiri     
2 4Aces GC    Peter      Uihlein    Pat          Perez       Dustin       Johnson     Patrick      Reed

Our goal is to pivot_longer into a dataframe with 3 columns name, first, last, the code seems like it should be simple however we are not sure what value to pass to names_pattern, we have:

zed %>%
  tidyr::pivot_longer(cols = -name,
                      names_to = c("fname", "lname"),
                      names_pattern = "???")

>Solution :

Here’s the way to go:

df_wide %>%
  tidyr::pivot_longer(cols = -name,
                      names_pattern = "(.*_name).*",
                      names_to = ".value")

#   name        first_name last_name 
# 1 Crushers GC Charles    Howell III
# 2 Crushers GC Paul       Casey     
# 3 Crushers GC Bryson     DeChambeau
# 4 Crushers GC Anirban    Lahiri    
# 5 4Aces GC    Peter      Uihlein   
# 6 4Aces GC    Pat        Perez     
# 7 4Aces GC    Dustin     Johnson   
# 8 4Aces GC    Patrick    Reed

Explanation:

You should define the pattern of existing columns in names_pattern. In this case, you only want one group that is either last_name or first_name. You should use regex to define those:

(.*_name).*

This will match every column that contains _name. The parenthesis are meant to capture groups. You only need the first part of the column names (so, not the _1, _2) in your expected output, so we use parenthesis only for one group. This is why there is only one names_to element.

In names_to, where you define new column names, .value refers to:

".value" indicates that the corresponding component of the column name
defines the name of the output column containing the cell values,
overriding values_to entirely.

Here, .value refers to whatever matches (.*_name), so both first_name and last_name. And that is why we end up with two columns. If there were middle_name, middle_name_1, … columns in the original data.frame, we would end up with a third middle_name column.