(This post does not explain what was requested below. There is the issue of merge before concatenation or sum, using these functions together to fill a column is the question.)
I have two dataframes that look like these (obviously these are examples):
data1 = {'Name': ['Alex','Alex','Cristiano','Cristiano','Fernando','Jonas','William'], 'Color': ['Blue','Red','Black','Blue','Yellow','Pink','Green'],
'Codes': ['','','','','','',''], 'Values': [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
data2 = {'Name': ['Jonas','Alex','Cristiano','Cristiano','Alex'], 'Color': ['Pink','Red','Black','Blue','Red'],
'Codes': ['1456','1450','1453','1530','1459'], 'Values': [12000.00,5000.50,78000.00,2000.00,1500.00]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
Are so:
df1:
Name Color Codes Values
0 Alex Blue NaN
1 Alex Red NaN
2 Cristiano Black NaN
3 Cristiano Blue NaN
4 Fernando Yellow NaN
5 Jonas Pink NaN
6 William Green NaN
df2:
Name Color Codes Values
0 Jonas Pink 1456 12000.0
1 Alex Red 1450 5000.5
2 Cristiano Black 1453 78000.0
3 Cristiano Blue 1530 2000.0
4 Alex Red 1459 1500.0
I want to fill the ‘Codes’ and ‘Values’ columns of df1 with the corresponding values ​​in df2 (‘Code’ and ‘Values’), doing the matching of the columns ‘Name’ and ‘Color’.
The problem is that there is more than one correspondence, for the column ‘Codes’ I want the concatenation of strings in ‘Code’ (separating with a comma), for the column ‘Values’ I want the sum of the correspondences.
I tried using merge, but this second part I don’t know how to implement.
To be clearer, the desired output of df1 is this:
Name Color Codes Values
0 Alex Blue NaN
1 Alex Red 1450 , 1459 6500.5
2 Cristiano Black 1453 78000.0
3 Cristiano Blue 1530 2000.0
4 Fernando Yellow NaN
5 Jonas Pink 1456 12000.0
6 William Green NaN
>Solution :
You can use pd.merge() using df1 as it is and a grouped version of df2 (grouped by Name and Color). The aggregation should be list on Codes and sum on Values:
pd.merge(df1,
df2.groupby(['Name','Color']).agg({'Codes':list,'Values':'sum'}).reset_index(),
how='left',
on=['Name','Color'],
suffixes=('_x', ''))\
.drop(['Codes_x','Values_x'],axis=1)
Which prints back:
Name Color Codes Values
0 Alex Blue NaN NaN
1 Alex Red [1450, 1459] 6500.5
2 Cristiano Black [1453] 78000.0
3 Cristiano Blue [1530] 2000.0
4 Fernando Yellow NaN NaN
5 Jonas Pink [1456] 12000.0
6 William Green NaN NaN