I have a data frame that looks like this
df=tibble(col1 = c("2397_A_merge",
"3779_A_merge",
"4958_BV"))
> df
# A tibble: 3 × 1
col1
<chr>
1 2397_A_merge
2 3779_A_merge
3 4958_BV
and I want to split it in a way that looks like this
col1 col2
2397_A merge
3779_A merge
4958_BV NA
Any help or recommendation is appreciated
>Solution :
We may use separate – the pattern is not clear. If it is a fixed word ‘merge’, then use a regex lookaround to split at _ that precedes the ‘merge’
library(tidyr)
separate(df, col1, into = c('col1', 'col2'), sep = "_(?=merge)", fill = "right")
-output
# A tibble: 3 × 2
col1 col2
<chr> <chr>
1 2397_A merge
2 3779_A merge
3 4958_BV <NA>
Or use extract to capture substrings as a group – (...) – the first group capture one or more digits (\\d+) from the start (^) of the string followed by a _ and one or more upper case letters ([A-Z]+), then followed by a _ (if present) and capture the rest of characters (.*) in second group if present
extract(df, col1, into = c("col1", "col2"), "^(\\d+_[A-Z]+)_?(.*)?")
# A tibble: 3 × 2
col1 col2
<chr> <chr>
1 2397_A "merge"
2 3779_A "merge"
3 4958_BV ""
Or with base R, create a unique delimiter by replacing the _ before the lower case letters with , and then use read.csv to read the column into two columns splitting at the ,
read.csv(text = sub("_([a-z]+)", ",\\1", df$col1),
header = FALSE, col.names = c("col1", "col2"), na.strings = "")
col1 col2
1 2397_A merge
2 3779_A merge
3 4958_BV <NA>