Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to groupby a dataframe by using a column and the last row of the group?

This is my DataFrame:

import pandas as pd 

df = pd.DataFrame(
    {
        'x': ['a', 'b', 'c', 'c', 'e', 'f', 'd', 'a', 'b', 'c', 'c', 'e', 'f', 'd'],
        'y': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'f', 'f', 'f', 'f', 'g', 'g', 'g'],
    }
)

And this is the output that I want:

   x  y
0   a  a
1   b  a
2   c  a
3   c  a
7   a  f
8   b  f
9   c  f
10  c  f

    x  y
4   e  b
5   f  b
6   d  b
11  e  g
12  f  g
13  d  g

These are the steps that are needed:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

a) Groupby y

b) Groupby last row of x

Basically groups are:

df1 = df.groupby('y').filter(lambda g: g.x.iloc[-1] == 'c')
df2 = df.groupby('y').filter(lambda g: g.x.iloc[-1] == 'd')

In this example I know I have two different values in the last rows, which are c and d, that is why I could filter them But in my data I do not know that.

>Solution :

IIUC, you could use a groupby.transform('last') to generate a novel grouper:

g = df.groupby('y')
last_x = g['x'].transform('last')

for k, group in df.groupby(last_x):
    print(f'group for last x: "{k}"')
    print(group)

NB. I am assuming the y form unique groups. If you can have a,a,b,b,a,a,b,b and this should be considered as 4 independent groups, use g = df.groupby(df['y'].ne(df['y'].shift()).cumsum()).

Faster variant without groupby for the first step, if the y values form unique groups:

mapper = df.drop_duplicates('y', keep='last').set_index('y')['x']
last_x = df['y'].map(mapper)

for k, group in df.groupby(last_x):
    print(f'group for last x: "{k}"')
    print(group)

Or:

last_x = df['x'].mask(df['y'].duplicated(keep='last')).bfill()

for k, group in df.groupby(last_x):
    print(f'group for last x: "{k}"')
    print(group)

Output:

group for last x: "c"
    x  y
0   a  a
1   b  a
2   c  a
3   c  a
7   a  f
8   b  f
9   c  f
10  c  f
group for last x: "d"
    x  y
4   e  b
5   f  b
6   d  b
11  e  g
12  f  g
13  d  g

Intermediate last_x:

0     c
1     c
2     c
3     c
4     d
5     d
6     d
7     c
8     c
9     c
10    c
11    d
12    d
13    d
Name: x, dtype: object
generalization

If you don’t necessarily want the last but an arbitrary function, you can pass a lambda to transform as you did in your example:

group_x = g['x'].transform(lambda g: g.iloc[-1])
output as a dictionary:
out = dict(list(df.groupby(last_x)))

Output:

{'c':     x  y
      0   a  a
      1   b  a
      2   c  a
      3   c  a
      7   a  f
      8   b  f
      9   c  f
      10  c  f,
 'd':     x  y
      4   e  b
      5   f  b
      6   d  b
      11  e  g
      12  f  g
      13  d  g}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading