remove extra words from text

August 24, 2022

ive been trying to remove extra words like {'by','the','and','of' ,'a'}
from text so my best way to do it is like this .

Code :

def clean_text(text):
    """
    takes the text and removes signs and some words
    """
    stopwords = {'by','the','and','of' ,'a'}
    result  = [word for word in re.split("\W+",text) if word.lower() not in stopwords]
    result = (' ').join(result)
    print(result)
    return result

#dummy text
long_string = "one Groups are marked by the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there"
clean_text(long_string)

my question is , is there any better way to do it without using forloop , does regex has any method to remove some words from text and ignore using forloop

>Solution :

You could use a regex replacement approach by forming an alternation of stop words and then removing them.

long_string = "one Groups are marked by the ()meta-characters. two They group together the expressions contained one inside them, and you can one repeat the contents of a group with a repeating qualifier, such as there"
words = ["by", "the", "and", "of", "a"]
regex = r'\s*\b(?:' + r'|'.join(words) + r')\b\s*'
output = re.sub(regex, ' ', long_string).strip()
print(output)

This prints:

one Groups are marked ()meta-characters. two They group together expressions contained one inside them, you can one repeat contents group with repeating qualifier, such as there