This is my DataFrame:
import pandas as pd
df = pd.DataFrame(
{
'x': ['a', 'b', 'c', 'c', 'e', 'f', 'd', 'a', 'b', 'c', 'c', 'e', 'f', 'd'],
'y': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'f', 'f', 'f', 'f', 'g', 'g', 'g'],
}
)
And this is the output that I want:
x y
0 a a
1 b a
2 c a
3 c a
7 a f
8 b f
9 c f
10 c f
x y
4 e b
5 f b
6 d b
11 e g
12 f g
13 d g
These are the steps that are needed:
a) Groupby y
b) Groupby last row of x
Basically groups are:
df1 = df.groupby('y').filter(lambda g: g.x.iloc[-1] == 'c')
df2 = df.groupby('y').filter(lambda g: g.x.iloc[-1] == 'd')
In this example I know I have two different values in the last rows, which are c
and d
, that is why I could filter
them But in my data I do not know that.
>Solution :
IIUC, you could use a groupby.transform('last')
to generate a novel grouper:
g = df.groupby('y')
last_x = g['x'].transform('last')
for k, group in df.groupby(last_x):
print(f'group for last x: "{k}"')
print(group)
NB. I am assuming the y
form unique groups. If you can have a,a,b,b,a,a,b,b and this should be considered as 4 independent groups, use g = df.groupby(df['y'].ne(df['y'].shift()).cumsum())
.
Faster variant without groupby
for the first step, if the y
values form unique groups:
mapper = df.drop_duplicates('y', keep='last').set_index('y')['x']
last_x = df['y'].map(mapper)
for k, group in df.groupby(last_x):
print(f'group for last x: "{k}"')
print(group)
Or:
last_x = df['x'].mask(df['y'].duplicated(keep='last')).bfill()
for k, group in df.groupby(last_x):
print(f'group for last x: "{k}"')
print(group)
Output:
group for last x: "c"
x y
0 a a
1 b a
2 c a
3 c a
7 a f
8 b f
9 c f
10 c f
group for last x: "d"
x y
4 e b
5 f b
6 d b
11 e g
12 f g
13 d g
Intermediate last_x
:
0 c
1 c
2 c
3 c
4 d
5 d
6 d
7 c
8 c
9 c
10 c
11 d
12 d
13 d
Name: x, dtype: object
generalization
If you don’t necessarily want the last but an arbitrary function, you can pass a lambda to transform
as you did in your example:
group_x = g['x'].transform(lambda g: g.iloc[-1])
output as a dictionary:
out = dict(list(df.groupby(last_x)))
Output:
{'c': x y
0 a a
1 b a
2 c a
3 c a
7 a f
8 b f
9 c f
10 c f,
'd': x y
4 e b
5 f b
6 d b
11 e g
12 f g
13 d g}