I have a dataframe with 50 + more columns, and the first 2 are unique IDs. For some reason for different IDs the data from the third column can be the exact same.
What I want to achieve is to delete the duplicates from the dataframe based on all columns starting from the third one. If there are more than 1 rows with different IDs and the same data from the third column, it is all the same which row we will keep, it can be the last one or the first one, whichever is easier to do.
I am fairly new to pandas, what I tried is something like this:
df.drop_duplicates(subset=df.iloc[2:], keep="last")
>Solution :
df.drop_duplicates expects a list of column names as the subset argument, so try this:
df.drop_duplicates(subset=df.columns[2:], keep="last")