Home How to remove the third word from a string, but only if the string contains exclusively letters?

Questions

How to remove the third word from a string, but only if the string contains exclusively letters?

February 20, 2023

I have a column containing strings (i.e. names of species), such as the one below:

 Species
---------
Aaaaba fossicollis
Aaadonta constricta babelthuapi
Aaadonta constricta constricta
Aaadonta constricta komakanensis
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos

My goal is to remove specifically the third word (words being separated by spaces) from every string that is made up exclusively of letters.
So the strings that include numbers would remain the same, as well as the strings that contain just two words.

My output would be this:

 Species
---------
Aaaaba fossicollis
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos

I have tried this code, but it is also removing part of the strings with numerical characters, as well as the second word in two worded strings:

df$Species<-gsub("\\s*\\w*$", "", df$Species)

>Solution :

We can use sub() with a capture group:

df$Species <- sub("(\\S+ \\S+) [A-Za-z]+(?!\\S)", "\\1", df$Species, perl=TRUE)

Here is an explanation of the regex pattern:

(\S+ \S+) match and capture the first two non whitespace terms
match a single space
[A-Za-z]+ then match a third word consisting of only letters
(?!\\S) assert that this third word is followed by space or the end of the string

Then, we replace with \1, to keep just the first two terms.