Merging multiple dataframes together with pandas

October 18, 2023

I have multiple dataframes that are structured as follows:

Molecule Name	Molecular weight	Score	Population Score	Population Error
A	100	12	15	0.2
B	205	0.4	17	0.8
C	367	17	11	0.82
D	510	9	19.6	0.1

Molecule Name	Molecular weight	Score	Population Score	Population Error
A	100	20	15	0.2
B	205	16	17	0.8
E	367	11	11	0.82
F	780	11	12	0.5

Imagine I had multiple dataframes where the molecule names, weights, and score can vary. But the population score and population error are the same across all dataframes.

I want to merge all the dataframes. The unique identifer is the Molecular weight. So if there are multiple dataframes which contain a molecular weight of 100, the scores of each get saved – take note of molecule C from the first dataframe and molecule E from the second – they have the same molecular weight so should be combined. I essentially want something that comes out like this:

Molecule Name	Molecular weight	Score1	Score2	Population Score	Population Error
A	100	12	20	15	0.2
B	205	0.4	16	17	0.8
C	367	17	11	11	0.82
D	510	9	0	19.6	0.1
F	780	0	11	12	0.5

I have tried just simple merges but it always creates a new row for Molecule name E. I want to merge on Molecular weight. If the Molecular weight isn’t in the original data frame then add a new row and populate the 0s. I hope this makes sense. Any and all help will be appreciated.

>Solution :

You could use a custom concat with de-duplication of the Score columns and groupby.first to get the first molecule name:

dfs = [df1, df2] # could be more than 2 input DataFrames

out = (pd.concat([d.set_index(['Molecular weight', 'Population Score', 'Population Error'])
                   .rename(columns={'Score': f'Score{i}'})
                  for i, d in enumerate(dfs, start=1)], axis=1)
         .groupby(level=0, axis=1).first()
         .fillna(0, downcast='infer')
         .reset_index()
      )

Variant for future versions of pandas:

dfs = [df1, df2]

out = (pd.concat([d.set_index(['Molecular weight', 'Population Score', 'Population Error'])
                   .rename(columns={'Score': f'Score{i}'})
                  for i, d in enumerate(dfs, start=1)], axis=1)
         .T.groupby(level=0).first().T
         .fillna(0)
         .reset_index()
      )

Output:

   Molecular weight  Population Score  Population Error Molecule Name  Score1  Score2
0               100              15.0              0.20             A    12.0      20
1               205              17.0              0.80             B     0.4      16
2               367              11.0              0.82             C    17.0      11
3               510              19.6              0.10             D     9.0       0
4               780              12.0              0.50             F     0.0      11