Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to remove possible suffix repetitions from a str column?

Consider the following dataframe, where the suffix in a str column might be repeating itself:

    Book
0   Book1.pdf
1   Book2.pdf.pdf
2   Book3.epub
3   Book4.mobi.mobi
4   Book5.epub.epub

Desired output (removed suffixes where needed)

    Book
0   Book1.pdf
1   Book2.pdf
2   Book3.epub
3   Book4.mobi
4   Book5.epub

I have tried splitting on the . character and then counting occurences of the last item to check if there is duplication.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I have used file paths only to illustrate my point! The contents of the column could be something different than paths!

>Solution :

Use a regex with a capturing group + reference and str.replace:

df['Book'] = df['Book'].str.replace(r'(\.[^.]+)\1$', r'\1', regex=True)

# or
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)(?=\1)$', '', regex=True)

Output:

         Book
0   Book1.pdf
1   Book2.pdf
2  Book3.epub
3  Book4.mobi
4  Book5.epub

regex demo 1

regex demo 2

generalization

if you want something generic that doesn’t depend on the .:

df['Book'] = df['Book'].str.replace(r'(.+)\1$', r'\1', regex=True)

regex demo

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading