Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to drop duplicated rows based on pattern change in dataframe?

Imagine I have a dataframe like this one below, with percentages by classes in specific dates for unique IDs:

import pandas as pd
df = pd.DataFrame({"ID":["A","A","A","A","A","A","A"],
                   "DATE":["01-1990","03-1990","04-1990","05-1990","06-1990","07-1990",
                           "08-1990"],
                   "CLASS A":[30,30,0,0,0,30,30],
                   "CLASS B":[50,50,50,50,50,50,50],
                   "CLASS C":[20,20,50,50,50,20,20]})
df
Out[4]: 
  ID     DATE  CLASS A  CLASS B  CLASS C
0  A  01-1990       30       50       20
1  A  03-1990       30       50       20
2  A  04-1990        0       50       50
3  A  05-1990        0       50       50
4  A  06-1990        0       50       50
5  A  07-1990       30       50       20
6  A  08-1990       30       50       20   

I would like to drop duplicated rows based on ID, CLASS A, CLASS B and CLASS C (and keep the first one), but only before it changes to another pattern of percentage. In this example, there are 2 changes of pattern (30/50/20 to 0/50/50 and then to 30/50/20 again). The result should be like this:

  ID     DATE  CLASS A  CLASS B  CLASS C
0  A  01-1990       30       50       20
2  A  04-1990        0       50       50
5  A  07-1990       30       50       20

I know how to remove duplicated rows based on the whole dataframe (df.drop_duplicates), but can’t do this directly in this case as it would remove the rows from index 5 and 6 as well. Anyone could help me?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

>Solution :

In your case, I wouldn’t use drop_duplicates but get the indices to keep using the shift method.

Something like:

compare_df = df[["ID", "CLASS A", "CLASS B", "CLASS C"]]
row_is_like_previous_one = (compare_df == compare_df.shift(1)).all(axis=1)
result = df[~row_is_like_previous_one]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading