Follow

Follow

Contact

Home Efficiently filtering out the last row of a duplicate column

Questions

Efficiently filtering out the last row of a duplicate column

byMR

March 10, 2023

I need to filter out the last row where col2 = 3 but preserve the rest of the dataframe.

I can do that like so, while maintaining the order relative to the index:

import pandas


d = {
     'col1': [0, 1, 2, 3, 3, 3, 3, 4, 5, 6],
     'col2': [0, 11, 21, 31, 32, 33, 34, 41, 51, 61]
    }

df = pandas.DataFrame(d)
df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]

pandas.concat([df2,df3]).sort_index()

    col1 col2
0   0    0
1   1    11
2   2    21
3   3    31
4   3    32
5   3    33
7   4    41
8   5    51
9   6    61

But for a larger dataframe, this operation gets progressively more expensive to perform.

Is there a more efficient way?

UPDATE

Based on the answers provided this far, here are the results:

import pandas
import random

dupes = 1000
rows = 10000000
d = {'col1': [random.choice(range(dupes)) for i in range(rows)], 'col2': [range for range in range(rows)]}
df = pandas.DataFrame(d)

df2 = df[df['col1'] != 3]
df3 = df[df['col1'] == 3].iloc[:-1]
%timeit pandas.concat([df2,df3]).sort_index()

df = pandas.DataFrame(d)
%timeit df.drop(df['col1'].where(df['col1'].eq(3)).last_valid_index())

df = pandas.DataFrame(d)
idx = df.loc[::-1, 'col1'].eq(3).idxmax()
%timeit df.drop(idx)

df = pandas.DataFrame(d)
%timeit df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

df = pandas.DataFrame(d)
%timeit df.drop(df.index[df['col1'].eq(3)][-1])

df = pandas.DataFrame(d)
%timeit df.drop((df['col1'].iloc[::-1] == 3).idxmax())

df = pandas.DataFrame(d)
%timeit df.loc[df['col1'].iloc[::-1].ne(3).rank(method = 'first').ne(1)]

df = pandas.DataFrame(d)
%timeit df.drop(index=df[df['col1'].eq(3)].index[-1:], axis=0)

703 ms ± 60.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
497 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
413 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
253 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
408 ms ± 8.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
404 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
792 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
491 ms ± 142 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>Solution :

Also possible:

out = df.loc[ df["col1"].ne(3) | df["col1"].duplicated(keep="last") ]

out:

   col1  col2
0     0     0
1     1    11
2     2    21
3     3    31
4     3    32
5     3    33
7     4    41
8     5    51
9     6    61

pandas

byMR

Published March 10, 2023

Add a comment

Leave a ReplyCancel reply

Read more

Questions

javascript: How do you use a function from the importing module in a function from the imported module/file

byMR

March 10, 2023

Questions

strange problem during debugging "no executable code of the debugger's target code type is associated with this line"

byMR

March 10, 2023

Questions

Why are aliases not removed (but adding them works fine)?

byMR

March 10, 2023

Questions

Create column representing instance rather than total count

byMR

March 10, 2023

Questions

"ValueError: x and y must be the same size" when plotting with matplotlib

byMR

March 10, 2023

Questions

find the total distance of point on x axis in js?

byMR

March 10, 2023