Replace multiple phrases with NA (or blank) in R


I am working in R.

I have some phrases that I want to remove from some text strings in a dataframe.
words_remove shows the phrases I want to replace. Unless the whole exact phrase is in the string, I don’t want it to be removed.

words_remove <- c("red cats", "blue dogs", "pink horse")

This is my data frame:

data <- data.frame(row_id=1:4, text = c("red cats don't exist", "I have a blue dog", "I don't like blue dogs", "I like horses"))
row_id text
1 red cats don’t exist
2 I have a blue dog
3 I don’t like blue dogs
4 I like horses

I want to replace all instances of "words_remove" in "text" with NA (or even better remove them entirely).

My required output:

row_id text
1 don’t exist
2 I have a blue dog
3 I don’t like
4 I like horses

In my real dataframe, there are many phrases in "words_remove" so case_when etc would be too time consuming to do I think.

Any ideas?

>Solution :

You may form a regex alternation of the phrases and do a replacement on that:

words_remove <- c("red cats", "blue dogs", "pink horse")
regex <- paste0("\\s*\\b(?:", paste(words_remove, collapse="|"), ")\\b\\s*")
data$text <- gsub("^\\s+|\\s+$", "", gsub(regex, " ", data$text))

row_id              text
1      1       don't exist
2      2 I have a blue dog
3      3      I don't like
4      4     I like horses

The strategy here is to replace any matching phrase plus any surrounding whitespace with just a single space. The outer call to gsub() strips off any remaining leading/trailing whitespace.

Leave a ReplyCancel reply