Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Merging multiple dataframes together with pandas

I have multiple dataframes that are structured as follows:

Molecule Name Molecular weight Score Population Score Population Error
A 100 12 15 0.2
B 205 0.4 17 0.8
C 367 17 11 0.82
D 510 9 19.6 0.1
Molecule Name Molecular weight Score Population Score Population Error
A 100 20 15 0.2
B 205 16 17 0.8
E 367 11 11 0.82
F 780 11 12 0.5

Imagine I had multiple dataframes where the molecule names, weights, and score can vary. But the population score and population error are the same across all dataframes.

I want to merge all the dataframes. The unique identifer is the Molecular weight. So if there are multiple dataframes which contain a molecular weight of 100, the scores of each get saved – take note of molecule C from the first dataframe and molecule E from the second – they have the same molecular weight so should be combined. I essentially want something that comes out like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Molecule Name Molecular weight Score1 Score2 Population Score Population Error
A 100 12 20 15 0.2
B 205 0.4 16 17 0.8
C 367 17 11 11 0.82
D 510 9 0 19.6 0.1
F 780 0 11 12 0.5

I have tried just simple merges but it always creates a new row for Molecule name E. I want to merge on Molecular weight. If the Molecular weight isn’t in the original data frame then add a new row and populate the 0s. I hope this makes sense. Any and all help will be appreciated.

>Solution :

You could use a custom concat with de-duplication of the Score columns and groupby.first to get the first molecule name:

dfs = [df1, df2] # could be more than 2 input DataFrames

out = (pd.concat([d.set_index(['Molecular weight', 'Population Score', 'Population Error'])
                   .rename(columns={'Score': f'Score{i}'})
                  for i, d in enumerate(dfs, start=1)], axis=1)
         .groupby(level=0, axis=1).first()
         .fillna(0, downcast='infer')
         .reset_index()
      )

Variant for future versions of pandas:

dfs = [df1, df2]

out = (pd.concat([d.set_index(['Molecular weight', 'Population Score', 'Population Error'])
                   .rename(columns={'Score': f'Score{i}'})
                  for i, d in enumerate(dfs, start=1)], axis=1)
         .T.groupby(level=0).first().T
         .fillna(0)
         .reset_index()
      )

Output:

   Molecular weight  Population Score  Population Error Molecule Name  Score1  Score2
0               100              15.0              0.20             A    12.0      20
1               205              17.0              0.80             B     0.4      16
2               367              11.0              0.82             C    17.0      11
3               510              19.6              0.10             D     9.0       0
4               780              12.0              0.50             F     0.0      11
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading