Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Text preprocess function cant seem to remove full twitter hashtag

Im trying to make a function that uses regular expressions to remove elements from a string

In this example the given text is
‘@twitterusername Crazy wind today no birding #Python’

I want it to look like
‘crazy wind today no birding’

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Instead if still includes the hashtag with this
‘crazy wind today no birding python’

Ive tried a few different patterns and cant seem to get it right here is the code

`def process(text):
processed_text = []

wordLemm = WordNetLemmatizer()

# -- Regex patterns --

# Remove urls pattern
url_pattern = r"https?://\S+"

# Remove usernames pattern
user_pattern = r'@[A-Za-z0-9_]+'

# Remove all characters except digits and alphabet pattern
alpha_pattern = "[^a-zA-Z0-9]"

# Remove twitter hashtags
hashtag_pattern = r'#\w+\b'



for tweet_string in text:
    
    # Change text to lower case
    tweet_string = tweet_string.lower()
    
    # Remove urls
    tweet_string = re.sub(url_pattern, '', tweet_string)
    
    # Remove usernames 
    tweet_string = re.sub(user_pattern, '', tweet_string)
    
    # Remove non alphabet
    tweet_string = re.sub(alpha_pattern, " ", tweet_string)
    
    # Remove hashtags
    tweet_string = re.sub(hashtag_pattern, " ", tweet_string)
    
    
    tweetwords = ''
    for word in tweet_string.split():
        # Checking if the word is a stopword.
        #if word not in stopwordlist:
        if len(word)>1:
            # Lemmatizing the word.
            word = wordLemm.lemmatize(word)
            tweetwords += (word+' ')
        
    processed_text.append(tweetwords)
    
return processed_text`
    
    
    

>Solution :

The problem is that you remove the non-alpha characters before the hashtag. This means that the ‘#’ is no longer in the input string, so the hashtag does not get recognized. You should reverse these:

 # Remove hashtags
    tweet_string = re.sub(hashtag_pattern, " ", tweet_string)
 # Remove non alphabet
    tweet_string = re.sub(alpha_pattern, " ", tweet_string)
    
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading