How to find the separator between words in a string vector

September 23, 2023

I wonder how to find the common separator between two-part words as long as the separator is defined within "[^[:alnum:]]+" in a string vector?

For example, in the first vector, the common separator is ".", and in the second vector, the common separator is "_".

Is it possible to have a function that accepts a vector like first or second, and outputs "." or "_"?

first = c("L2DF.L2DA", "L2G.L2DA", "L2L.L2DA", "L2M.L2DA", "L2P.L2DA", 
"L2V.L2DA", "L2G.L2DF", "L2L.L2DF", "L2M.L2DF", "L2P.L2DF", "L2V.L2DF", 
"L2L.L2G", "L2M.L2G", "L2P.L2G", "L2M.L2L", "L2P.L2L", "L2P.L2M", 
"L2R.L2DA", "L2R.L2DF", "L2R.L2G", "L2R.L2L", "L2R.L2M", "L2R.L2P", 
"L2V.L2R", "L2V.L2G", "L2V.L2L", "L2V.L2M", "L2V.L2P")

second = c("L2DF_L2DA", "L2G_L2DA", "L2L_L2DA", "L2M_L2DA", "L2P_L2DA", 
"L2V_L2DA", "L2G_L2DF", "L2L_L2DF", "L2M_L2DF", "L2P_L2DF", "L2V_L2DF", 
"L2L_L2G", "L2M_L2G", "L2P_L2G", "L2M_L2L", "L2P_L2L", "L2P_L2M", 
"L2R_L2DA", "L2R_L2DF", "L2R_L2G", "L2R_L2L", "L2R_L2M", "L2R_L2P", 
"L2V_L2R", "L2V_L2G", "L2V_L2L", "L2V_L2M", "L2V_L2P")

>Solution :

You could have something like this:

sep_extract <- \(s) stringr::str_extract_all(s, "[^[:alnum:]]") |> unlist() |> unique()

# or using base R:
sep_extract <- \(s) gsub("[a-zA-Z0-9]", "", s) |> unique()

sep_extract(first) # [1] "."
sep_extract(second) # [1] "_"

Notes:

This will only work if you know the only non-alphanumerics in your strings are separators. If that’s not the case, you would have to specify which is which, or use a more complicated regex.
You can remove the + from the regex if you use str_extract_all(), as it will pick up the second one regardless.
If you’d prefer to keep each combination as it’s own thing, you can remove unlist().