Home What is the best way to filter the groups that have at least N rows that meets the conditions of a mask?

Questions

What is the best way to filter the groups that have at least N rows that meets the conditions of a mask?

November 6, 2024

This is my DataFrame:

import pandas as pd
df = pd.DataFrame({
    'a': [10, 20, 30, 50, 50, 50, 4, 100],
    'b': [30, 3, 200, 25, 24, 31, 29, 2],
    'd': list('aaabbbcc')
})

Expected output:

    a    b  d
0  10   30  a
1  20    3  a
2  30  200  a

The grouping is by column d. I want to return the groups that have at least two instances of this mask

m = (df.b.gt(df.a))

This is what I have tried. It works but I wonder if there is a better/more efficient way to do it.

out = df.groupby('d').filter(lambda x: len(x.loc[x.b.gt(x.a)]) >= 2)

>Solution :

With pandas

You could use a groupby.transform on the mask with sum to produce a boolean Series:

m = df['b'].gt(df['a'])
out = df[m.groupby(df['d']).transform('sum').ge(2)]

Output:

    a    b  d
0  10   30  a
1  20    3  a
2  30  200  a

Intermediates:

     a    b  d      m  transform('sum')  ge(2)
0   10   30  a   True                 2   True
1   20    3  a  False                 2   True
2   30  200  a   True                 2   True
3   50   25  b  False                 0  False
4   50   24  b  False                 0  False
5   50   31  b  False                 0  False
6    4   29  c   True                 1  False
7  100    2  c  False                 1  False

Alternative:

counts = m.groupby(df['d']).sum()

out = df[df['d'].isin(counts.index[counts>=2])]

With numpy

Alternatively, one could avoid the costly groupby with pure numpy.

This first approach with add.reduceat requires the groups to be consecutive:

groups = df['d'].ne(df['d'].shift()).values
m = df['b'].gt(df['a']).values
idx = np.nonzero(groups)[0]
out =  df[df['d'].isin(df['d'].iloc[idx[np.add.reduceat(m, idx)>=2]])]

This second one with pandas.factorize and numpy.bincount would work even with shuffled groups:

a, idx = pd.factorize(df['d'])
out = df[df['d'].isin(idx[np.bincount(a, weights=m) >= 2])]

timings

dataframe

byMR

Published November 06, 2024

Add a comment

candidate template ignored: couldn't infer template argument 'T' with std::variant

byMR

November 6, 2024

Questions

create a file using cat command inside bash script

byMR

November 6, 2024

Questions

remove validators in angular reactive form not working

byMR

November 6, 2024

Questions

Why does using nested loops cause the program to freeze?

byMR

November 6, 2024

Questions

How to properly send an array of objects to the backend using fetch?

byMR

November 6, 2024

Questions

How to select column range based on partial column names in Pandas Python?

byMR

November 6, 2024

What is the best way to filter the groups that have at least N rows that meets the conditions of a mask?

MEDevel.com: Open-source for Healthcare and Education

>Solution :

With pandas

With numpy

timings

Like this:

Leave a ReplyCancel reply

Read more

candidate template ignored: couldn't infer template argument 'T' with std::variant

create a file using cat command inside bash script

remove validators in angular reactive form not working

Why does using nested loops cause the program to freeze?

How to properly send an array of objects to the backend using fetch?

How to select column range based on partial column names in Pandas Python?

Keep Up to Date with the Most Important News

What is the best way to filter the groups that have at least N rows that meets the conditions of a mask?

MEDevel.com: Open-source for Healthcare and Education

>Solution :

With pandas

With numpy

timings

Share this:

Like this:

Leave a ReplyCancel reply

Keep Up to Date with the Most Important News

Read more

candidate template ignored: couldn't infer template argument 'T' with std::variant

create a file using cat command inside bash script

remove validators in angular reactive form not working

Why does using nested loops cause the program to freeze?

How to properly send an array of objects to the backend using fetch?

How to select column range based on partial column names in Pandas Python?

Discover more from Dev solutions