I am working on a task wherein, I need to filter in rows that contain the group’s max and min values and filter out other rows. This is to understand how the values change at each decile.
np.random.seed(0)
df = pd.DataFrame({'id' : range(1,31),
'score' : np.random.uniform(size = 30)})
df
id score
0 1 0.548814
1 2 0.715189
2 3 0.602763
3 4 0.544883
4 5 0.423655
5 6 0.645894
6 7 0.437587
7 8 0.891773
8 9 0.963663
9 10 0.383442
10 11 0.791725
11 12 0.528895
12 13 0.568045
13 14 0.925597
14 15 0.071036
15 16 0.087129
16 17 0.020218
17 18 0.832620
18 19 0.778157
19 20 0.870012
20 21 0.978618
21 22 0.799159
22 23 0.461479
23 24 0.780529
24 25 0.118274
25 26 0.639921
26 27 0.143353
27 28 0.944669
28 29 0.521848
29 30 0.414662
I then add the decile column using:
df['decile'] = pd.qcut(df['score'], 10, labels=False)
Now I tried both:
df.transform((df['score'] == df.groupby('decile')['score'].min()) or (df['score'] == df.groupby('decile')['score'].max()))
and
df.transform(df['score'].eq(df.groupby('decile')['score'].min().values).any() or df['score'].eq(df.groupby('decile')['score'].max().values).any())
But both are not working, can someone please help with this.
>Solution :
Use DataFrame.transform with 2 masks by Series.eq chained by bitwise OR – |:
g = df.groupby('decile')['score']
out = df[df['score'].eq(g.transform('min')) | df['score'].eq(g.transform('max'))]
print (out)
id score decile
1 2 0.715189 6
2 3 0.602763 5
3 4 0.544883 4
5 6 0.645894 5
6 7 0.437587 2
9 10 0.383442 1
10 11 0.791725 7
11 12 0.528895 3
12 13 0.568045 4
13 14 0.925597 8
15 16 0.087129 0
16 17 0.020218 0
17 18 0.832620 7
19 20 0.870012 8
20 21 0.978618 9
22 23 0.461479 3
23 24 0.780529 6
24 25 0.118274 1
27 28 0.944669 9
29 30 0.414662 2
Details:
print (df.assign(min=df['score'].eq(g.transform('min')),
max=df['score'].eq(g.transform('max')),
both = lambda x: x['min'] | x['max']))
id score decile min max both
0 1 0.548814 4 False False False
1 2 0.715189 6 True False True
2 3 0.602763 5 True False True
3 4 0.544883 4 True False True
4 5 0.423655 2 False False False
5 6 0.645894 5 False True True
6 7 0.437587 2 False True True
7 8 0.891773 8 False False False
8 9 0.963663 9 False False False
9 10 0.383442 1 False True True
10 11 0.791725 7 True False True
11 12 0.528895 3 False True True
12 13 0.568045 4 False True True
13 14 0.925597 8 False True True
14 15 0.071036 0 False False False
15 16 0.087129 0 False True True
16 17 0.020218 0 True False True
17 18 0.832620 7 False True True
18 19 0.778157 6 False False False
19 20 0.870012 8 True False True
20 21 0.978618 9 False True True
21 22 0.799159 7 False False False
22 23 0.461479 3 True False True
23 24 0.780529 6 False True True
24 25 0.118274 1 True False True
25 26 0.639921 5 False False False
26 27 0.143353 1 False False False
27 28 0.944669 9 True False True
28 29 0.521848 3 False False False
29 30 0.414662 2 True False True
Another idea with lambda function, but solution is slow:
out = df[df.groupby('decile')['score'].transform(lambda x: x.eq(x.min()) | x.eq(x.max()))]
np.random.seed(0)
df = pd.DataFrame({'id' : range(1,30001),
'score' : np.random.uniform(size = 30000)})
df['decile'] = pd.qcut(df['score'], 5000, labels=False)
In [44]: %%timeit
...: g = df.groupby('decile')['score']
...: out = df[df['score'].eq(g.transform('min')) | df['score'].eq(g.transform('max'))]
...:
3.27 ms ± 239 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [45]: %%timeit
...: df[df.groupby('decile')['score'].transform(lambda x: x.eq(x.min()) | x.eq(x.max()))]
...:
1.45 s ± 12.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)