Split a column into two with R

December 15, 2022

I have a data frame that looks like this


df=tibble(col1 = c("2397_A_merge", 
                   "3779_A_merge", 
                   "4958_BV"))

> df
# A tibble: 3 × 1
  col1        
  <chr>       
1 2397_A_merge
2 3779_A_merge
3 4958_BV

and I want to split it in a way that looks like this

col1     col2
2397_A   merge
3779_A   merge
4958_BV   NA

Any help or recommendation is appreciated

>Solution :

We may use separate – the pattern is not clear. If it is a fixed word ‘merge’, then use a regex lookaround to split at _ that precedes the ‘merge’

library(tidyr)
separate(df, col1, into = c('col1', 'col2'), sep = "_(?=merge)", fill = "right")

-output

# A tibble: 3 × 2
  col1    col2 
  <chr>   <chr>
1 2397_A  merge
2 3779_A  merge
3 4958_BV <NA>

Or use extract to capture substrings as a group – (...) – the first group capture one or more digits (\\d+) from the start (^) of the string followed by a _ and one or more upper case letters ([A-Z]+), then followed by a _ (if present) and capture the rest of characters (.*) in second group if present

extract(df, col1, into = c("col1", "col2"), "^(\\d+_[A-Z]+)_?(.*)?")
# A tibble: 3 × 2
  col1    col2   
  <chr>   <chr>  
1 2397_A  "merge"
2 3779_A  "merge"
3 4958_BV ""

Or with base R, create a unique delimiter by replacing the _ before the lower case letters with , and then use read.csv to read the column into two columns splitting at the ,

read.csv(text = sub("_([a-z]+)", ",\\1", df$col1),
  header = FALSE, col.names = c("col1", "col2"), na.strings = "")
     col1  col2
1  2397_A merge
2  3779_A merge
3 4958_BV  <NA>