Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python regex to extract hashtag from within larger string

I have a pandas dataframe that contains a column of social media captions. Where hashtags have been used they appear in the following format {hashtag|\#|WorldWaterDay}. I want to loop though this column and reformat these hashtags strings in the format #WorldWaterDay.

I am quite rusty on my regex. I can easily find the strings (assuming they all start and end with {}) using ^{.*}$, but I am looking for an efficient use of regex to find and reformat these hashtags. I can find and split on the hashtag, remove the | then append the hashtag to the hashtag text in several steps, but I was hoping someone could provide some advice on a pure regex solution.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

You just need a regex that will match the existing format:

\{hashtag\|\\#\|([^}]+)}

which matches:

  • \{hashtag\|\\#\| : literally {hashtag|\#|
  • ([^}]+) : some number of non-} characters, captured in group 1
  • } : a } character

You can then replace that with #\1. In python:

df['Caption'] = df['Caption'].str.replace(r'\{hashtag\|\\#\|([^}]+)}', r'#\1', regex=True)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading