My input is a pandas DataFrame :
item foo_x foo_y bar_x bar_y
0 1 A B C D
1 2 D E F G
2 3 H I J K
3 4 L M N O
df = pd.DataFrame({'item': [1, 2, 3, 4],
'foo_x': ['A', 'D', 'H', 'L'],
'foo_y': ['B', 'E', 'I', 'M'],
'bar_x': ['C', 'F', 'J', 'N'],
'bar_y': ['D', 'G', 'K', 'O']})
I’m not asking too much to the groupby method, I only expect this standard aggregation :
item x y
0 1 [A, C] [B, D]
1 2 [D, F] [E, G]
2 3 [H, J] [I, K]
3 4 [L, N] [M, O]
But my code below gives a nonsense error :
df_output = (
df.rename(lambda x: x.split("_")[-1], axis=1)
.groupby(level=0, axis=1).agg(list)
)
ValueError: Length of values (2) does not match length of index (4)
To be honest, this is absolutely counterintuitive based on how we’re used to apply groupby(..., axis=0).
Can you please explain the logic behind ?
>Solution :
The issue is that iterating over a DataFrame yields the column names:
list(pd.DataFrame({'A': [1, 2], 'B': [3, 4]}))
# ['A', 'B']
Using a small print hack to see what’s going on in our groupby:
(df.rename(lambda x: x.split("_")[-1], axis=1)
.groupby(level=0, axis=1).agg(lambda x: print(list(x)))
)
Printed output:
['item']
['x', 'x']
['y', 'y']
To avoid that, you need to convert to numpy:
df_output = (
df.rename(lambda x: x.split("_")[-1], axis=1)
.groupby(level=0, axis=1).agg(lambda x: x.to_numpy().tolist())
)
Output:
item x y
0 [1] [A, C] [B, D]
1 [2] [D, F] [E, G]
2 [3] [H, J] [I, K]
3 [4] [L, N] [M, O]