Extracting a substring before a multiple and regular expression pattern

October 27, 2023

In to need to get a substring from the following elements of a R data frame column. In detail, I need to take the substring located before the first number or the first open brachet (‘[‘). Even the trailing space should be removed.

   [1] "Arturo Beniamo 29 10 2015.docx"               
   [2] "Arturo Beniamo [30 12 2015].docx"               
   [3] "Dominici Leonardo 02 06 2019.docx"                
   [4] "Didonna Marco 07 09 2023.docx"

This should be the result:

   [1] "Arturo Beniamo"               
   [2] "Arturo Beniamo"               
   [3] "Dominici Leonardo"                
   [4] "Didonna Marco"

>Solution :

You may use :

x <- c("Arturo Beniamo 29 10 2015.docx", "Arturo Beniamo [30 12 2015].docx" , 
       "Dominici Leonardo 02 06 2019.docx", "Didonna Marco 07 09 2023.docx")

sub('\\s(\\d|\\[).*', '', x)
#[1] "Arturo Beniamo"    "Arturo Beniamo"   "Dominici Leonardo" "Didonna Marco"

This removes a whitespace (\\s) followed by either a number (\\d) or an opening square bracket ([).