How to groupby a dataframe by using a column and the last row of the group?

February 16, 2024

This is my DataFrame:

import pandas as pd 

df = pd.DataFrame(
    {
        'x': ['a', 'b', 'c', 'c', 'e', 'f', 'd', 'a', 'b', 'c', 'c', 'e', 'f', 'd'],
        'y': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'f', 'f', 'f', 'f', 'g', 'g', 'g'],
    }
)

And this is the output that I want:

   x  y
0   a  a
1   b  a
2   c  a
3   c  a
7   a  f
8   b  f
9   c  f
10  c  f

    x  y
4   e  b
5   f  b
6   d  b
11  e  g
12  f  g
13  d  g

These are the steps that are needed:

a) Groupby y

b) Groupby last row of x

Basically groups are:

df1 = df.groupby('y').filter(lambda g: g.x.iloc[-1] == 'c')
df2 = df.groupby('y').filter(lambda g: g.x.iloc[-1] == 'd')

In this example I know I have two different values in the last rows, which are c and d, that is why I could filter them But in my data I do not know that.

>Solution :

IIUC, you could use a groupby.transform('last') to generate a novel grouper:

g = df.groupby('y')
last_x = g['x'].transform('last')

for k, group in df.groupby(last_x):
    print(f'group for last x: "{k}"')
    print(group)

NB. I am assuming the y form unique groups. If you can have a,a,b,b,a,a,b,b and this should be considered as 4 independent groups, use g = df.groupby(df['y'].ne(df['y'].shift()).cumsum()).

Faster variant without groupby for the first step, if the y values form unique groups:

mapper = df.drop_duplicates('y', keep='last').set_index('y')['x']
last_x = df['y'].map(mapper)

for k, group in df.groupby(last_x):
    print(f'group for last x: "{k}"')
    print(group)

Or:

last_x = df['x'].mask(df['y'].duplicated(keep='last')).bfill()

for k, group in df.groupby(last_x):
    print(f'group for last x: "{k}"')
    print(group)

Output:

group for last x: "c"
    x  y
0   a  a
1   b  a
2   c  a
3   c  a
7   a  f
8   b  f
9   c  f
10  c  f
group for last x: "d"
    x  y
4   e  b
5   f  b
6   d  b
11  e  g
12  f  g
13  d  g

Intermediate last_x:

0     c
1     c
2     c
3     c
4     d
5     d
6     d
7     c
8     c
9     c
10    c
11    d
12    d
13    d
Name: x, dtype: object

generalization

If you don’t necessarily want the last but an arbitrary function, you can pass a lambda to transform as you did in your example:

group_x = g['x'].transform(lambda g: g.iloc[-1])

output as a dictionary:

out = dict(list(df.groupby(last_x)))

Output:

{'c':     x  y
      0   a  a
      1   b  a
      2   c  a
      3   c  a
      7   a  f
      8   b  f
      9   c  f
      10  c  f,
 'd':     x  y
      4   e  b
      5   f  b
      6   d  b
      11  e  g
      12  f  g
      13  d  g}