Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Removing a sentence from a text in dataframe column

I want to format a text-column in the dataframe in a following way:

In entries where the last character of a string is a colon ":" I want to delete the last sentence in this text i.e. a substring starting from a character after the last ".", "?" or "!" and finishing on that colon.

Example df:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

index    text
1        Trump met with Putin. Learn more here:
2        New movie by Christopher Nolan! Watch here:
3        Campers: Get ready to stop COVID-19 in its tracks!
4        London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

after formatting should look like this:

index    text
1        Trump met with Putin.
2        New movie by Christopher Nolan!
3        Campers: Get ready to stop COVID-19 in its tracks!
4        London was building a bigger rival to the Eiffel Tower. Then it all went wrong.

>Solution :

lets do it with regex to have more problems

df.text = df.text.str.replace(r"(?<=[.!?]).*?:\s*$", "", regex=True)

now df.text.tolist() is

['Trump met with Putin.',
 'New movie by Christopher Nolan!',
 'Campers: Get ready to stop COVID-19 in its tracks!',
 'London was building a bigger rival to the Eiffel Tower. Then it all went wrong.']

variable lookbehind ftw

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading