Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Deleting immediately subsequent rows which are the exact same as the previous for specific columns

I have a dataframe similar to the following.

import pandas as pd

data = pd.DataFrame({'ind': [111,222,333,444,555,666,777,888,999,000], 
                     'col1': [1,2,2,2,3,4,5,5,6,7], 
                     'col2': [9,2,2,2,9,9,5,5,9,9], 
                     'col3': [11,2,2,2,11,11,5,5,11,11], 
                     'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']})

There is an index ind, a number of columns col 1, 2 and 3, and some other column with a value val. Within the three columns 1, 2 and 3 there are a number of rows which are the exact same as the previous row, for instance row with index 333 and 444 are the same as 222. My actual data set is larger but what I need to do is delete all rows which have the exact same value as the immediate previous row for a number of columns (col1, col2, col3 here).

This would give me a dataframe like this with indeces 333/444 and 888 removed:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data_clean = pd.DataFrame({'ind': [111,222,555,666,777,999,000], 
                     'col1': [1,2,3,4,5,6,7], 
                     'col2': [9,2,9,9,5,9,9], 
                     'col3': [11,2,11,11,5,11,11], 
                     'val': ['a', 'b', 'e', 'f', 'g', 'i', 'j']})

What is the best way to go about this for a larger dataframe?

>Solution :

You can use shift and any for boolean indexing:

cols = ['col1', 'col2', 'col3']
out = data[data[cols].ne(data[cols].shift()).any(axis=1)]
# DeMorgan's equivalent:
# out = data[~data[cols].eq(data[cols].shift()).all(axis=1)]

Output:

   ind  col1  col2  col3 val
0  111     1     9    11   a
1  222     2     2     2   b
4  555     3     9    11   e
5  666     4     9    11   f
6  777     5     5     5   g
8  999     6     9    11   i
9    0     7     9    11   j

Intermediates

# shifted dataset
data[cols].shift()

   col1  col2  col3
0   NaN   NaN   NaN
1   1.0   9.0  11.0
2   2.0   2.0   2.0
3   2.0   2.0   2.0
4   2.0   2.0   2.0
5   3.0   9.0  11.0
6   4.0   9.0  11.0
7   5.0   5.0   5.0
8   5.0   5.0   5.0
9   6.0   9.0  11.0

# comparison
data[cols].ne(data[cols].shift())

    col1   col2   col3
0   True   True   True
1   True   True   True
2  False  False  False
3  False  False  False
4   True   True   True
5   True  False  False
6   True   True   True
7  False  False  False
8   True   True   True
9   True  False  False

# aggregation
data[cols].ne(data[cols].shift()).any(axis=1)

0     True
1     True
2    False
3    False
4     True
5     True
6     True
7    False
8     True
9     True
dtype: bool
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading