Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to remove the third word from a string, but only if the string contains exclusively letters?

I have a column containing strings (i.e. names of species), such as the one below:

 Species
---------
Aaaaba fossicollis
Aaadonta constricta babelthuapi
Aaadonta constricta constricta
Aaadonta constricta komakanensis
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos

My goal is to remove specifically the third word (words being separated by spaces) from every string that is made up exclusively of letters.
So the strings that include numbers would remain the same, as well as the strings that contain just two words.

My output would be this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

 Species
---------
Aaaaba fossicollis
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta
Aaadonta constricta ssp. 2 DAB-2020
Aaadonta irregularis
Aaadonta sp. 1 DAB-2020
Aaadonta sp. DAB-2021
Aacanthocnema dobsoni
Aagaardia protensa
Aaptos aaptos

I have tried this code, but it is also removing part of the strings with numerical characters, as well as the second word in two worded strings:

df$Species<-gsub("\\s*\\w*$", "", df$Species)

>Solution :

We can use sub() with a capture group:

df$Species <- sub("(\\S+ \\S+) [A-Za-z]+(?!\\S)", "\\1", df$Species, perl=TRUE)

Here is an explanation of the regex pattern:

  • (\S+ \S+) match and capture the first two non whitespace terms
  • match a single space
  • [A-Za-z]+ then match a third word consisting of only letters
  • (?!\\S) assert that this third word is followed by space or the end of the string

Then, we replace with \1, to keep just the first two terms.

Demo

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading