I need to create a dataframe with in each cell a 2D array. This 2D array is a combination of one array from dataframe A and one array from dataframe B.
I tried applying a function on dataframe A in which I tried to combine those array combination with numpy. However I’m not sure that this is the fastest/most elegant way. Since I need to combine 2 dataframes of about 200GB this way, I really need the fastest/most elegant solution.
Any suggestions of how to speed this up or to make this a more elegant solution?
Dataframe A:
col1 col2
0 [1, 2, 3, 4] [9, 10, 11, 12]
1 [5, 6, 7, 8] [13, 14, 15, 16]
Dataframe B:
col1 col2
0 [17, 18, 19, 20] [25, 26, 27, 28]
1 [21, 22, 23, 24] [29, 30, 31, 32]
Desired result:
col1 col2
0 [[1, 2, 3, 4], [17, 18, 19, 20]] [[9, 10, 11, 12], [25, 26, 27, 28]]
1 [[5, 6, 7, 8], [21, 22, 23, 24]] [[13, 14, 15, 16], [29, 30, 31, 32]]
Code I used:
import numpy as np
import pandas as pd
def function(x):
a[x.name] = pd.Series(np.stack([x, b[x.name]], axis=1).tolist(), name=x.name)
data_a = {'col1': [[1, 2, 3, 4], [5, 6, 7, 8]], 'col2': [[9, 10, 11, 12], [13, 14, 15, 16]]}
a = pd.DataFrame(data=data_a)
data_b = {'col1': [[17, 18, 19, 20], [21, 22, 23, 24]], 'col2': [[25, 26, 27, 28], [29, 30, 31, 32]]}
b = pd.DataFrame(data=data_b)
print(a)
print()
print(b)
a.apply(function)
print(a)
>Solution :
Honestly, using arrays/nested lists in a DataFrame is really not why pandas has been designed. Use numpy arrays.
Nevertheless, assuming that the DataFrames are already aligned, you can try:
out = pd.DataFrame(np.dstack([a, b]).tolist(), columns=a.columns)
Output:
col1 col2
0 [[1, 2, 3, 4], [17, 18, 19, 20]] [[9, 10, 11, 12], [25, 26, 27, 28]]
1 [[5, 6, 7, 8], [21, 22, 23, 24]] [[13, 14, 15, 16], [29, 30, 31, 32]]