Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Fastest operation to combine 2 dataframes with 1D array in each cell to 1 dataframe with 2D array in each cell

I need to create a dataframe with in each cell a 2D array. This 2D array is a combination of one array from dataframe A and one array from dataframe B.

I tried applying a function on dataframe A in which I tried to combine those array combination with numpy. However I’m not sure that this is the fastest/most elegant way. Since I need to combine 2 dataframes of about 200GB this way, I really need the fastest/most elegant solution.

Any suggestions of how to speed this up or to make this a more elegant solution?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Dataframe A:

           col1              col2
0  [1, 2, 3, 4]   [9, 10, 11, 12]
1  [5, 6, 7, 8]  [13, 14, 15, 16]

Dataframe B:

               col1              col2
0  [17, 18, 19, 20]  [25, 26, 27, 28]
1  [21, 22, 23, 24]  [29, 30, 31, 32]

Desired result:

                               col1                                  col2
0  [[1, 2, 3, 4], [17, 18, 19, 20]]   [[9, 10, 11, 12], [25, 26, 27, 28]]
1  [[5, 6, 7, 8], [21, 22, 23, 24]]  [[13, 14, 15, 16], [29, 30, 31, 32]]

Code I used:

import numpy as np
import pandas as pd

def function(x):
  a[x.name] = pd.Series(np.stack([x, b[x.name]], axis=1).tolist(), name=x.name)

data_a = {'col1': [[1, 2, 3, 4], [5, 6, 7, 8]], 'col2': [[9, 10, 11, 12], [13, 14, 15, 16]]}
a = pd.DataFrame(data=data_a)

data_b = {'col1': [[17, 18, 19, 20], [21, 22, 23, 24]], 'col2': [[25, 26, 27, 28], [29, 30, 31, 32]]}
b = pd.DataFrame(data=data_b)

print(a)
print()
print(b)

a.apply(function)

print(a)

>Solution :

Honestly, using arrays/nested lists in a DataFrame is really not why pandas has been designed. Use numpy arrays.

Nevertheless, assuming that the DataFrames are already aligned, you can try:

out = pd.DataFrame(np.dstack([a, b]).tolist(), columns=a.columns)

Output:

                               col1                                  col2
0  [[1, 2, 3, 4], [17, 18, 19, 20]]   [[9, 10, 11, 12], [25, 26, 27, 28]]
1  [[5, 6, 7, 8], [21, 22, 23, 24]]  [[13, 14, 15, 16], [29, 30, 31, 32]]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading