Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Remove all words after a specific word including the specific word from a column r

I’m looking to remove all words and characters that appear after specific words that appear in a column in my data.

This is what my data looks like.
A specific example what I like to accomplish is that I like to remove any word after "aka" and "dba" and the words "aka" and "dba" from my data.

Number.of.Workers company_name
5195 82 valley ho hotels aka kings inn
5196 82 aluminum precision products
5197 79 levity of brea dba brea improv
5198 79 crunch
5199 71 comedy club of los angeles dba hollywood improv
5200 65 andre-boudin bakeries inc dba boudin
structure(list(Number.of.Workers = c("82", "82", "79", "79", 
"71", "65", "62", "58", "56", "53", "49"), company_name = c("valley ho hotels aka kings inn", 
"aluminum precision products", "levity of brea  dba brea improv", 
"crunch", "comedy club of los angeles  dba hollywood improv", 
"andre-boudin bakeries inc   dba boudin", "comedy club of san jose  dba san jose improv", 
"comedy club of brea  dba ontario improv", "sprout bost ", "culver west lp - playa provisions", 
"faa concord h dba concord honda")), row.names = 5195:5205, class = "data.frame")

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You can use sub() as follows:

df$company_name = sub("\\W+?(aka|dba)\\W+.*$", "", df$company_name)

output:

     Number.of.Workers                      company_name
5195                82                  valley ho hotels
5196                82       aluminum precision products
5197                79                    levity of brea
5198                79                            crunch
5199                71        comedy club of los angeles
5200                65         andre-boudin bakeries inc
5201                62           comedy club of san jose
5202                58               comedy club of brea
5203                56                      sprout bost 
5204                53 culver west lp - playa provisions
5205                49                     faa concord h

Note: thanks to @Chris Ruehlemann for pointing out sub vs gsub. The difference is that the former replaces the first occurrence, while the latter replaces all.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading