I’m looking to remove all words and characters that appear after specific words that appear in a column in my data.
This is what my data looks like.
A specific example what I like to accomplish is that I like to remove any word after "aka" and "dba" and the words "aka" and "dba" from my data.
| Number.of.Workers | company_name | |
|---|---|---|
| 5195 | 82 | valley ho hotels aka kings inn |
| 5196 | 82 | aluminum precision products |
| 5197 | 79 | levity of brea dba brea improv |
| 5198 | 79 | crunch |
| 5199 | 71 | comedy club of los angeles dba hollywood improv |
| 5200 | 65 | andre-boudin bakeries inc dba boudin |
structure(list(Number.of.Workers = c("82", "82", "79", "79",
"71", "65", "62", "58", "56", "53", "49"), company_name = c("valley ho hotels aka kings inn",
"aluminum precision products", "levity of brea dba brea improv",
"crunch", "comedy club of los angeles dba hollywood improv",
"andre-boudin bakeries inc dba boudin", "comedy club of san jose dba san jose improv",
"comedy club of brea dba ontario improv", "sprout bost ", "culver west lp - playa provisions",
"faa concord h dba concord honda")), row.names = 5195:5205, class = "data.frame")
>Solution :
You can use sub() as follows:
df$company_name = sub("\\W+?(aka|dba)\\W+.*$", "", df$company_name)
output:
Number.of.Workers company_name
5195 82 valley ho hotels
5196 82 aluminum precision products
5197 79 levity of brea
5198 79 crunch
5199 71 comedy club of los angeles
5200 65 andre-boudin bakeries inc
5201 62 comedy club of san jose
5202 58 comedy club of brea
5203 56 sprout bost
5204 53 culver west lp - playa provisions
5205 49 faa concord h
Note: thanks to @Chris Ruehlemann for pointing out sub vs gsub. The difference is that the former replaces the first occurrence, while the latter replaces all.