Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

What is the best way to filter the groups that have at least N rows that meets the conditions of a mask?

This is my DataFrame:

import pandas as pd
df = pd.DataFrame({
    'a': [10, 20, 30, 50, 50, 50, 4, 100],
    'b': [30, 3, 200, 25, 24, 31, 29, 2],
    'd': list('aaabbbcc')
})

Expected output:

    a    b  d
0  10   30  a
1  20    3  a
2  30  200  a

The grouping is by column d. I want to return the groups that have at least two instances of this mask

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

m = (df.b.gt(df.a))

This is what I have tried. It works but I wonder if there is a better/more efficient way to do it.

out = df.groupby('d').filter(lambda x: len(x.loc[x.b.gt(x.a)]) >= 2)

>Solution :

With

You could use a groupby.transform on the mask with sum to produce a boolean Series:

m = df['b'].gt(df['a'])
out = df[m.groupby(df['d']).transform('sum').ge(2)]

Output:

    a    b  d
0  10   30  a
1  20    3  a
2  30  200  a

Intermediates:

     a    b  d      m  transform('sum')  ge(2)
0   10   30  a   True                 2   True
1   20    3  a  False                 2   True
2   30  200  a   True                 2   True
3   50   25  b  False                 0  False
4   50   24  b  False                 0  False
5   50   31  b  False                 0  False
6    4   29  c   True                 1  False
7  100    2  c  False                 1  False

Alternative:

counts = m.groupby(df['d']).sum()

out = df[df['d'].isin(counts.index[counts>=2])]

With

Alternatively, one could avoid the costly groupby with pure numpy.

This first approach with add.reduceat requires the groups to be consecutive:

groups = df['d'].ne(df['d'].shift()).values
m = df['b'].gt(df['a']).values
idx = np.nonzero(groups)[0]
out =  df[df['d'].isin(df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]])]

This second one with pandas.factorize and numpy.bincount would work even with shuffled groups:

a, idx = pd.factorize(df['d'])
out = df[df['d'].isin(idx[np.bincount(a, weights=m) >= 2])]

timings

enter image description here

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading