Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Drop duplicates from a pandas dataframe based on all columns starting from the third one

I have a dataframe with 50 + more columns, and the first 2 are unique IDs. For some reason for different IDs the data from the third column can be the exact same.

What I want to achieve is to delete the duplicates from the dataframe based on all columns starting from the third one. If there are more than 1 rows with different IDs and the same data from the third column, it is all the same which row we will keep, it can be the last one or the first one, whichever is easier to do.

I am fairly new to pandas, what I tried is something like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df.drop_duplicates(subset=df.iloc[2:], keep="last")

>Solution :

df.drop_duplicates expects a list of column names as the subset argument, so try this:

df.drop_duplicates(subset=df.columns[2:], keep="last")
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading