Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How can I filter groups by comparing the first value of each group and the last cummax that changes conditionally?

My DataFrame:

import pandas as pd
df = pd.DataFrame(
    {
        'group': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e'],
        'num': [1, 2, 3, 1, 12, 12, 13, 2, 4, 2, 5, 6, 10, 20, 30]
    }
)

Expected output is getting three groups from above df

  group  num
0      a    1
1      a    2
2      a    3

   group  num
6      c   13
7      c    2
8      c    4

   group  num
12     e   10
13     e   20
14     e   30

Logic:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I want to compare the first value of each group to the last cummax of num column. I can explain better by this code:

df['last_num'] = df.groupby('group')['num'].tail(1)
df['last_num'] = df.last_num.ffill().cummax()

But I think what I really need is this desired_cummax:

   group  num  last_num   desired_cummax
0      a    1       NaN    3
1      a    2       NaN    3
2      a    3       3.0    3
3      b    1       3.0    3
4      b   12       3.0    3
5      b   12      12.0    3 
6      c   13      12.0    3
7      c    2      12.0    3
8      c    4      12.0    4
9      d    2      12.0    4
10     d    5      12.0    4
11     d    6      12.0    4
12     e   10      12.0    4
13     e   20      12.0    4
14     e   30      30.0    30

I don’t want a new cummax if the first value of num for each group is less than last_num.

For example for group b, the first value of num is 1. Since it is less that its last_num, when it reaches the end of the group b it should not put 12. It should still be 3.

Now for group c, since its first value is more than last_num, when it reaches at the end of group c, a new cummax will be set.

After that I want to filter the groups. If df.num.iloc[0] > df.desired_cummax.iloc[0]

Note that the first group should be in the expected output no matter what.

Maybe there is a better approach to solve this. But this is what I have thought might work.

My attempt was creating last_num but I don’t know how to continue.

>Solution :

IIUC, you can aggregate as first/last per group, mask the unwanted values and map back to the group. Finally shift one row up:

tmp = df.groupby('group')['num'].agg(['first', 'last'])

s = tmp['last'].where(tmp['last'].shift(fill_value=0).le(tmp['first'])).ffill().cummax()

df['desired_cummax'] = df['group'].map(s.shift().bfill()).shift(-1).fillna(df['num'])

Output:

   group  num  desired_cummax
0      a    1             3.0
1      a    2             3.0
2      a    3             3.0
3      b    1             3.0
4      b   12             3.0
5      b   12             3.0
6      c   13             3.0
7      c    2             3.0
8      c    4             4.0
9      d    2             4.0
10     d    5             4.0
11     d    6             4.0
12     e   10             4.0
13     e   20             4.0
14     e   30            30.0

Intermediates:

# computation of the mapping Series "s"
       first  last  last.shift(fill_value=0)  .le(tmp['first'])  where  .ffill()
group                                                                           
a          1     3                         0               True    3.0       3.0
b          1    12                         3              False    NaN       3.0
c         13     4                        12               True    4.0       4.0
d          2     6                         4              False    NaN       4.0
e         10    30                         6               True   30.0      30.0

# shifting before mapping
          s  s.shift()  .bfill()
group                           
a       3.0        NaN       3.0
b       3.0        3.0       3.0
c       4.0        3.0       3.0
d       4.0        4.0       4.0
e      30.0        4.0       4.0

# mapping
   group  map  .shift(-1)  .fillna(df['num'])
0      a  3.0         3.0                 3.0
1      a  3.0         3.0                 3.0
2      a  3.0         3.0                 3.0
3      b  3.0         3.0                 3.0
4      b  3.0         3.0                 3.0
5      b  3.0         3.0                 3.0
6      c  3.0         3.0                 3.0
7      c  3.0         3.0                 3.0
8      c  3.0         4.0                 4.0
9      d  4.0         4.0                 4.0
10     d  4.0         4.0                 4.0
11     d  4.0         4.0                 4.0
12     e  4.0         4.0                 4.0
13     e  4.0         4.0                 4.0
14     e  4.0         NaN                30.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading