Here is a generic code representing what is happening in my script:
import pandas as pd
import numpy as np
dic = {}
for i in np.arange(0,10):
dic[str(i)] = df = pd.DataFrame(np.random.randint(0,1000,size=(5000, 20)),
columns=list('ABCDEFGHIJKLMNOPQRST'))
df_out = pd.DataFrame(index = df.index)
for i in np.arange(0,10):
df_out['A_'+str(i)] = dic[str(i)]['A'].astype('int')
df_out['D_'+str(i)] = dic[str(i)]['D'].astype('int')
df_out['H_'+str(i)] = dic[str(i)]['H'].astype('int')
df_out['I_'+str(i)] = dic[str(i)]['I'].astype('int')
df_out['M_'+str(i)] = dic[str(i)]['M'].astype('int')
df_out['O_'+str(i)] = dic[str(i)]['O'].astype('int')
df_out['Q_'+str(i)] = dic[str(i)]['Q'].astype('int')
df_out['R_'+str(i)] = dic[str(i)]['R'].astype('int')
df_out['S_'+str(i)] = dic[str(i)]['S'].astype('int')
df_out['T_'+str(i)] = dic[str(i)]['T'].astype('int')
df_out['C_'+str(i)] = dic[str(i)]['C'].astype('int')
You will notice that as soon as df_out (output) numbers of inseted columns exceed 100 I get the following warning:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert
many times, which has poor performance. Consider using pd.concat instead
The question is how could I use:
pd.concat()
And still have the custom column name that depens on the dictionary key ?
IMPORTANT: I still would like to keep a specific column selections, not all of them.
Like in the example: A, D , H , I etc…
Many thanks for your help !
>Solution :
Use concat
with flatten MultiIndex
in map
:
cols = ['A','D']
df_out = pd.concat({k: v[cols] for k, v in dic.items()}, axis=1).astype('int')
df_out.columns = df_out.columns.map(lambda x: f'{x[1]}_{x[0]}')
print (df_out)
A_0 D_0 A_1 D_1 A_2 D_2 A_3 D_3
0 116 341 396 502 944 483 398 839
1 128 621 102 70 561 656 70 169
2 982 44 613 775 822 379 246 25
3 830 987 366 481 861 632 906 676
4 533 349 741 410 305 422 874 19