I want to format a text-column in the dataframe in a following way:
In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.
Example df:
index text
1 Trump met with Putin. Learn more here:
2 New movie by Christopher Nolan! Watch here:
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
after formatting should look like this:
index text
1 Trump met with Putin.
2 New movie by Christopher Nolan!
3 Campers: Get ready to stop COVID-19 in its tracks!
4 London was building a bigger rival to the Eiffel Tower. Then it all went wrong.
>Solution :
lets do it with regex to have more problems
df.text = df.text.str.replace(r"(?<=[.!?]).*?:\s*$", "", regex=True)
now df.text.tolist() is
['Trump met with Putin.',
'New movie by Christopher Nolan!',
'Campers: Get ready to stop COVID-19 in its tracks!',
'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']
variable lookbehind ftw