Python regex to extract hashtag from within larger string

Advertisements

I have a pandas dataframe that contains a column of social media captions. Where hashtags have been used they appear in the following format {hashtag|\#|WorldWaterDay}. I want to loop though this column and reformat these hashtags strings in the format #WorldWaterDay.

I am quite rusty on my regex. I can easily find the strings (assuming they all start and end with {}) using ^{.*}$, but I am looking for an efficient use of regex to find and reformat these hashtags. I can find and split on the hashtag, remove the | then append the hashtag to the hashtag text in several steps, but I was hoping someone could provide some advice on a pure regex solution.

>Solution :

You just need a regex that will match the existing format:

\{hashtag\|\\#\|([^}]+)}

which matches:

  • \{hashtag\|\\#\| : literally {hashtag|\#|
  • ([^}]+) : some number of non-} characters, captured in group 1
  • } : a } character

You can then replace that with #\1. In python:

df['Caption'] = df['Caption'].str.replace(r'\{hashtag\|\\#\|([^}]+)}', r'#\1', regex=True)

Leave a ReplyCancel reply