I have this dataframe :
import pandas as pd
df = pd.DataFrame({'subject': ['a', 'a', 'b', 'b', 'c', 'd'],
'level': ['hard', None, None, 'easy', None, 'medium']})
print(df)
subject level
0 a hard
1 a None
2 b None
3 b easy
4 c None
5 d medium
When using the code :
df.groupby('subject').transform(lambda group: print(group))
I got four printed groups. That’s ok because we have four subjects : a, b, c and d
But I don’t understand the group 2, i feel like transform have accumulated the values of the two first groups. Also, there is a weird indentation that seem to separate the first group from the second one
# ------------------------ group1
0 hard
1 None
Name: level, dtype: object
# ------------------------ group2
level
0 hard
1 None
2 None
3 easy
Name: level, dtype: object
# ------------------------ group3
4 None
Name: level, dtype: object
# ------------------------ group4
5 medium
Name: level, dtype: object
Can someone please explain the logic to me ?
>Solution :
It’s not, but transform runs some checks to see the type of the output. In general you don’t use transform for its side effects (you should use apply as shown later), but rather to return something of the same shape as the input.
What exactly happens might be more explicit with a custom function:
def f(group):
print('---')
print(group.name) # with `transform` this shouldn't give the group name
print(group)
print('===')
df.groupby('subject').transform(f)
Output:
--- # first group
level
0 hard
1 None
Name: level, dtype: object
===
--- # internal pandas check (not a real group)
a
level
0 hard
1 None
===
--- # second group
level
2 None
3 easy
Name: level, dtype: object
===
--- # third group
level
4 None
Name: level, dtype: object
===
--- # fourth group
level
5 medium
Name: level, dtype: object
===
In comparison, using apply that does give the group names and which you can use for this kind of operations:
df.groupby('subject').apply(f)
---
a
subject level
0 a hard
1 a None
===
---
b
subject level
2 b None
3 b easy
===
---
c
subject level
4 c None
===
---
d
subject level
5 d medium
===
don’t use transform to manually work on groups.
Here is another example. In transform, group.name returns the current Series name, see what happens with multiple columns:
df = pd.DataFrame({'subject': ['a', 'a', 'b', 'b', 'c', 'd'],
'level': ['hard', None, None, 'easy', None, 'medium'],
'level2': ['hard', None, None, 'easy', None, 'medium']
})
df.groupby('subject').transform(lambda g: print(g.name))
print output:
level # first group, column "level"
level2 # first group, column "level2"
a # some internal check run only once
level # second group, column "level"
level2 # second group, column "level2"
level # etc.
level2
level
level2