generate dummy data
np.random.seed(42)
df = pd.DataFrame({'subject': ['A'] * 10 + ['B'] * 10,
'trial': list(range(5)) * 4,
'value1': np.random.randint(0, 100, 20),
'value2': np.random.randint(0, 100, 20)
})
df = df.set_index(['subject', 'trial']).sort_index()
print(df)
value1 value2
subject trial
A 0 51 1
0 20 75
1 92 63
1 82 57
2 14 59
2 86 21
3 71 20
3 74 88
4 60 32
4 74 48
B 0 87 90
0 52 79
1 99 58
1 1 14
2 23 41
2 87 61
3 2 91
3 29 61
4 21 59
4 37 46
Notice: Each subject / trial combination has multiple rows.
I want to create a array with the rows as nested dimensions.
My (as I find ugly) data transformation via list
tmp=list()
for idx in df.index.unique():
tmp.append(df.loc[idx].to_numpy())
goal = np.array(tmp)
print(goal)
[[[51 1]
[20 75]]
...
[[21 59]
[37 46]]]
Can you show me a native pandas / numpy way to do it (without the list crutch)?
>Solution :
To be able to generate a non-ragged numpy array, the number of duplicates must be equal for all values. Thus you don’t have to loop over them. Just find out the number and reshape
n = len(df)/(~df.index.duplicated()).sum()
assert n.is_integer()
out = df.to_numpy().reshape(-1, df.shape[1], int(n))
Output:
array([[[51, 1],
[20, 75]],
[[92, 63],
[82, 57]],
[[14, 59],
[86, 21]],
[[71, 20],
[74, 88]],
[[60, 32],
[74, 48]],
[[87, 90],
[52, 79]],
[[99, 58],
[ 1, 14]],
[[23, 41],
[87, 61]],
[[ 2, 91],
[29, 61]],
[[21, 59],
[37, 46]]])