Consider the following dataframe, where the suffix in a str column might be repeating itself:
Book
0 Book1.pdf
1 Book2.pdf.pdf
2 Book3.epub
3 Book4.mobi.mobi
4 Book5.epub.epub
Desired output (removed suffixes where needed)
Book
0 Book1.pdf
1 Book2.pdf
2 Book3.epub
3 Book4.mobi
4 Book5.epub
I have tried splitting on the . character and then counting occurences of the last item to check if there is duplication.
I have used file paths only to illustrate my point! The contents of the column could be something different than paths!
>Solution :
Use a regex with a capturing group + reference and str.replace:
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)\1$', r'\1', regex=True)
# or
df['Book'] = df['Book'].str.replace(r'(\.[^.]+)(?=\1)$', '', regex=True)
Output:
Book
0 Book1.pdf
1 Book2.pdf
2 Book3.epub
3 Book4.mobi
4 Book5.epub
generalization
if you want something generic that doesn’t depend on the .:
df['Book'] = df['Book'].str.replace(r'(.+)\1$', r'\1', regex=True)