I have a column containing strings (i.e. names of species), such as the one below:
Species
---------
Aaaaba fossicollis
Aaadonta constricta babelthuapi
Aaadonta constricta constricta
Aaadonta constricta komakanensis
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos
My goal is to remove specifically the third word (words being separated by spaces) from every string that is made up exclusively of letters.
So the strings that include numbers would remain the same, as well as the strings that contain just two words.
My output would be this:
Species
---------
Aaaaba fossicollis
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos
I have tried this code, but it is also removing part of the strings with numerical characters, as well as the second word in two worded strings:
df$Species<-gsub("\\s*\\w*$", "", df$Species)
>Solution :
We can use sub() with a capture group:
df$Species <- sub("(\\S+ \\S+) [A-Za-z]+(?!\\S)", "\\1", df$Species, perl=TRUE)
Here is an explanation of the regex pattern:
(\S+ \S+)match and capture the first two non whitespace termsmatch a single space[A-Za-z]+then match a third word consisting of only letters(?!\\S)assert that this third word is followed by space or the end of the string
Then, we replace with \1, to keep just the first two terms.