I have a pandas dataframe that contains a column of social media captions. Where hashtags have been used they appear in the following format {hashtag|\#|WorldWaterDay}
. I want to loop though this column and reformat these hashtags strings in the format #WorldWaterDay
.
I am quite rusty on my regex. I can easily find the strings (assuming they all start and end with {}
) using ^{.*}$
, but I am looking for an efficient use of regex to find and reformat these hashtags. I can find and split on the hashtag, remove the |
then append the hashtag to the hashtag text in several steps, but I was hoping someone could provide some advice on a pure regex solution.
>Solution :
You just need a regex that will match the existing format:
\{hashtag\|\\#\|([^}]+)}
which matches:
\{hashtag\|\\#\|
: literally{hashtag|\#|
([^}]+)
: some number of non-}
characters, captured in group 1}
: a}
character
You can then replace that with #\1
. In python:
df['Caption'] = df['Caption'].str.replace(r'\{hashtag\|\\#\|([^}]+)}', r'#\1', regex=True)