Home How can I filter groups by comparing the first value of each group and the last cummax that changes conditionally?

Questions

How can I filter groups by comparing the first value of each group and the last cummax that changes conditionally?

June 26, 2024

My DataFrame:

import pandas as pd
df = pd.DataFrame(
    {
        'group': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e'],
        'num': [1, 2, 3, 1, 12, 12, 13, 2, 4, 2, 5, 6, 10, 20, 30]
    }
)

Expected output is getting three groups from above df

  group  num
0      a    1
1      a    2
2      a    3

   group  num
6      c   13
7      c    2
8      c    4

   group  num
12     e   10
13     e   20
14     e   30

Logic:

I want to compare the first value of each group to the last cummax of num column. I can explain better by this code:

df['last_num'] = df.groupby('group')['num'].tail(1)
df['last_num'] = df.last_num.ffill().cummax()

But I think what I really need is this desired_cummax:

   group  num  last_num   desired_cummax
0      a    1       NaN    3
1      a    2       NaN    3
2      a    3       3.0    3
3      b    1       3.0    3
4      b   12       3.0    3
5      b   12      12.0    3 
6      c   13      12.0    3
7      c    2      12.0    3
8      c    4      12.0    4
9      d    2      12.0    4
10     d    5      12.0    4
11     d    6      12.0    4
12     e   10      12.0    4
13     e   20      12.0    4
14     e   30      30.0    30

I don’t want a new cummax if the first value of num for each group is less than last_num.

For example for group b, the first value of num is 1. Since it is less that its last_num, when it reaches the end of the group b it should not put 12. It should still be 3.

Now for group c, since its first value is more than last_num, when it reaches at the end of group c, a new cummax will be set.

After that I want to filter the groups. If df.num.iloc[0] > df.desired_cummax.iloc[0]

Note that the first group should be in the expected output no matter what.

Maybe there is a better approach to solve this. But this is what I have thought might work.

My attempt was creating last_num but I don’t know how to continue.

>Solution :

IIUC, you can aggregate as first/last per group, mask the unwanted values and map back to the group. Finally shift one row up:

tmp = df.groupby('group')['num'].agg(['first', 'last'])

s = tmp['last'].where(tmp['last'].shift(fill_value=0).le(tmp['first'])).ffill().cummax()

df['desired_cummax'] = df['group'].map(s.shift().bfill()).shift(-1).fillna(df['num'])

Output:

   group  num  desired_cummax
0      a    1             3.0
1      a    2             3.0
2      a    3             3.0
3      b    1             3.0
4      b   12             3.0
5      b   12             3.0
6      c   13             3.0
7      c    2             3.0
8      c    4             4.0
9      d    2             4.0
10     d    5             4.0
11     d    6             4.0
12     e   10             4.0
13     e   20             4.0
14     e   30            30.0

Intermediates:

# computation of the mapping Series "s"
       first  last  last.shift(fill_value=0)  .le(tmp['first'])  where  .ffill()
group                                                                           
a          1     3                         0               True    3.0       3.0
b          1    12                         3              False    NaN       3.0
c         13     4                        12               True    4.0       4.0
d          2     6                         4              False    NaN       4.0
e         10    30                         6               True   30.0      30.0

# shifting before mapping
          s  s.shift()  .bfill()
group                           
a       3.0        NaN       3.0
b       3.0        3.0       3.0
c       4.0        3.0       3.0
d       4.0        4.0       4.0
e      30.0        4.0       4.0

# mapping
   group  map  .shift(-1)  .fillna(df['num'])
0      a  3.0         3.0                 3.0
1      a  3.0         3.0                 3.0
2      a  3.0         3.0                 3.0
3      b  3.0         3.0                 3.0
4      b  3.0         3.0                 3.0
5      b  3.0         3.0                 3.0
6      c  3.0         3.0                 3.0
7      c  3.0         3.0                 3.0
8      c  3.0         4.0                 4.0
9      d  4.0         4.0                 4.0
10     d  4.0         4.0                 4.0
11     d  4.0         4.0                 4.0
12     e  4.0         4.0                 4.0
13     e  4.0         4.0                 4.0
14     e  4.0         NaN                30.0