Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

check to see if first row of a group consist of said groups greatest value in pandas and subset data as such

So let’s say I have the following df

pd.DataFrame(
    {
        'store':['a','a','a','a','b','b','b','b','c','c','c','c'],
        'avg_sales':[100,20,30,25,50,60,70,80,90,60,40,50],
        'month':[0,1,2,3,0,1,2,3,0,1,2,3]
     
     })

    store   avg_sales   month
0   a   100 0
1   a   20  1
2   a   30  2
3   a   25  3
4   b   50  0
5   b   60  1
6   b   70  2
7   b   80  3
8   c   90  0
9   c   60  1
10  c   40  2
11  c   50  3

What I am trying to do is find all stores where their first month (month==0) is the greatest avg_sales value. Thus, my desired output is to just keep store a and c since their stores had the greatest avg_sale on month==0:

    store   avg_sales   month
0   a   100 0
1   a   20  1
2   a   30  2
3   a   25  3
8   c   90  0
9   c   60  1
10  c   40  2
11  c   50  3

I actually started out with a datetime for each month and used the following code to get the cumcounts:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df['month'] = df.sort_values(by=['store','date']).groupby('store').cumcount()

Now I just need to figure out the logic that only keeps stores where their month==0 is their greatest avg_sales number. Obviously you can do something simple like .loc[], but it seems like you need to do another subset within the loc to find out if month==0 is actually the greatest sales number for each group? I’m a bit confused. I could just use np.where() but that wouldn’t account for the multiple groups out-of-the-box…

I can use the following code to make a column that consists of the max avg_sales number for each store:

df['max_sales'] = df.groupby('store').avg_sales.transform('max')


store   avg_sales   month   max_sales
0   a   100 0   100
1   a   20  1   100
2   a   30  2   100
3   a   25  3   100
4   b   50  0   80
5   b   60  1   80
6   b   70  2   80
7   b   80  3   80
8   c   90  0   90
9   c   60  1   90
10  c   40  2   90
11  c   50  3   90

Now I am stuck on the last step of the logic.

EDIT: just fixing grammar

>Solution :

You can doing filter

df.groupby('store').filter(lambda x : x['avg_sales'][x['month']==0] == x['avg_sales'].max())
Out[623]: 
   store  avg_sales  month
0      a        100      0
1      a         20      1
2      a         30      2
3      a         25      3
8      c         90      0
9      c         60      1
10     c         40      2
11     c         50      3

For your transform

s = df.groupby('store').avg_sales.transform('max')

id = df.loc[(df['avg_sales'] ==s) & (df['month'] ==0),'store'] 

out = df.loc[df['store'].isin(id)]

Or we try sort_values with drop_duplicates

s = df.sort_values(['avg_sales']).drop_duplicates('store',keep='last')
df.loc[df.store.isin(s.loc[s['month']==0,'store'])]
Out[630]: 
   store  avg_sales  month
0      a        100      0
1      a         20      1
2      a         30      2
3      a         25      3
8      c         90      0
9      c         60      1
10     c         40      2
11     c         50      3
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading