How to drop duplicated rows based on pattern change in dataframe?

June 26, 2023

Imagine I have a dataframe like this one below, with percentages by classes in specific dates for unique IDs:

import pandas as pd
df = pd.DataFrame({"ID":["A","A","A","A","A","A","A"],
                   "DATE":["01-1990","03-1990","04-1990","05-1990","06-1990","07-1990",
                           "08-1990"],
                   "CLASS A":[30,30,0,0,0,30,30],
                   "CLASS B":[50,50,50,50,50,50,50],
                   "CLASS C":[20,20,50,50,50,20,20]})
df
Out[4]: 
  ID     DATE  CLASS A  CLASS B  CLASS C
0  A  01-1990       30       50       20
1  A  03-1990       30       50       20
2  A  04-1990        0       50       50
3  A  05-1990        0       50       50
4  A  06-1990        0       50       50
5  A  07-1990       30       50       20
6  A  08-1990       30       50       20

I would like to drop duplicated rows based on ID, CLASS A, CLASS B and CLASS C (and keep the first one), but only before it changes to another pattern of percentage. In this example, there are 2 changes of pattern (30/50/20 to 0/50/50 and then to 30/50/20 again). The result should be like this:

  ID     DATE  CLASS A  CLASS B  CLASS C
0  A  01-1990       30       50       20
2  A  04-1990        0       50       50
5  A  07-1990       30       50       20

I know how to remove duplicated rows based on the whole dataframe (df.drop_duplicates), but can’t do this directly in this case as it would remove the rows from index 5 and 6 as well. Anyone could help me?

>Solution :

In your case, I wouldn’t use drop_duplicates but get the indices to keep using the shift method.

Something like:

compare_df = df[["ID", "CLASS A", "CLASS B", "CLASS C"]]
row_is_like_previous_one = (compare_df == compare_df.shift(1)).all(axis=1)
result = df[~row_is_like_previous_one]