Pandas PerformanceWarning: DataFrame is highly fragmented. Whats the efficient solution?

Here is a generic code representing what is happening in my script:

import pandas as pd
import numpy as np

dic = {}

for i in np.arange(0,10):
    dic[str(i)] = df = pd.DataFrame(np.random.randint(0,1000,size=(5000, 20)), 
                                    columns=list('ABCDEFGHIJKLMNOPQRST'))
    
df_out = pd.DataFrame(index = df.index)

for i in np.arange(0,10):
    df_out['A_'+str(i)] = dic[str(i)]['A'].astype('int')
    df_out['D_'+str(i)] = dic[str(i)]['D'].astype('int')
    df_out['H_'+str(i)] = dic[str(i)]['H'].astype('int')
    df_out['I_'+str(i)] = dic[str(i)]['I'].astype('int')
    df_out['M_'+str(i)] = dic[str(i)]['M'].astype('int')
    df_out['O_'+str(i)] = dic[str(i)]['O'].astype('int')
    df_out['Q_'+str(i)] = dic[str(i)]['Q'].astype('int')
    df_out['R_'+str(i)] = dic[str(i)]['R'].astype('int')
    df_out['S_'+str(i)] = dic[str(i)]['S'].astype('int')
    df_out['T_'+str(i)] = dic[str(i)]['T'].astype('int')
    df_out['C_'+str(i)] = dic[str(i)]['C'].astype('int')

You will notice that as soon as df_out (output) numbers of inseted columns exceed 100 I get the following warning:

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider using pd.concat instead

The question is how could I use:

pd.concat()

And still have the custom column name that depens on the dictionary key ?

IMPORTANT: I still would like to keep a specific column selections, not all of them.
Like in the example: A, D , H , I etc…

Many thanks for your help !

>Solution :

Use concat with flatten MultiIndex in map:

cols = ['A','D']
df_out = pd.concat({k: v[cols] for k, v in dic.items()}, axis=1).astype('int')
df_out.columns = df_out.columns.map(lambda x: f'{x[1]}_{x[0]}')

print (df_out)
   A_0  D_0  A_1  D_1  A_2  D_2  A_3  D_3
0  116  341  396  502  944  483  398  839
1  128  621  102   70  561  656   70  169
2  982   44  613  775  822  379  246   25
3  830  987  366  481  861  632  906  676
4  533  349  741  410  305  422  874   19

Dev solutions

Solutions for development problems

Pandas PerformanceWarning: DataFrame is highly fragmented. Whats the efficient solution?

>Solution :

Like this:

Related

Leave a ReplyCancel reply

>Solution :

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Dev solutions