Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Efficiently filtering out the last row of a duplicate column

I need to filter out the last row where col2 = 3 but preserve the rest of the dataframe.

I can do that like so, while maintaining the order relative to the index:

import pandas


d = {
     'col1': [0, 1, 2, 3, 3, 3, 3, 4, 5, 6],
     'col2': [0, 11, 21, 31, 32, 33, 34, 41, 51, 61]
    }

df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]

pandas.concat([df2,df3]).sort_index()
    col1 col2
0   0    0
1   1    11
2   2    21
3   3    31
4   3    32
5   3    33
7   4    41
8   5    51
9   6    61

But for a larger dataframe, this operation gets progressively more expensive to perform.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Is there a more efficient way?

UPDATE

Based on the answers provided this far, here are the results:

import pandas
import random

dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)

df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()

df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())

df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)

df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])

df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())

df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]

df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)
703 ms ± 60.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
497 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
413 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
253 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
404 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
792 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
491 ms ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>Solution :

Also possible:

out = df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

out:

   col1  col2
0     0     0
1     1    11
2     2    21
3     3    31
4     3    32
5     3    33
7     4    41
8     5    51
9     6    61
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading