Home Pandas – Merge DataFrame, keep non-null values on common columns, keep average on another column

Questions

Pandas – Merge DataFrame, keep non-null values on common columns, keep average on another column

April 10, 2024

I am working with two DataFrames, defined as such:

import pandas as pd
df1 = pd.DataFrame([[1, 'a', 0.95], [2, 'b', 0.92], [3, 'c',  0.91]], columns=['id','value','similarity'])

df2 = pd.DataFrame([[3, 'c', 0.93], [4, 'd', 0.92], [5, 'e',  0.99]], columns=['id','value','similarity'])

df1

id  name   similarity
1   a      0.95
2   b      0.92
3   c      0.91

df2

id  name   similarity
3   c      0.93
4   d      0.92
5   e      0.99

I now want to combine both DataFrames by the key attribute id, while avoiding the creation of NaNs in the name column. For this, I use pd.combine_first, as suggested in this answer, to get this:

df3 = df1.set_index('id').combine_first(df2.set_index('id'))
df3

id  name   similarity   
1   a      0.95
2   b      0.92
3   c      0.91
4   d      0.92
5   e      0.99

However, I have an additional requirement. When a similarity value exists on both sets, I want to take their average as the new value. So for instance my desired output would be something like:

id  name  similarity    
1   a      0.95
2   b      0.92
3   c      0.92 <-- (0.91 + 0.93 / 2)
4   d      0.92
5   e      0.99

Preferably using pandas or numpy, how can I average similarities when both DataFrames have a value?

>Solution :

You can concat, then groupby.mean:

out = pd.concat([df1, df2]).groupby(['id', 'value'], as_index=False).mean()

Output:

   id value  similarity
0   1     a        0.95
1   2     b        0.92
2   3     c        0.92
3   4     d        0.92
4   5     e        0.99

If you want to have the combine_first behavior for some columns and the average for others, you could use groupby.agg with first/mean depending on the columns:

df1['x'] = 'one'
df2['x'] = 'two'

out = (pd.concat([df1, df2]).groupby(['id', 'value'], as_index=False)
         .agg({'similarity': 'mean', 'x': 'first'})
      )

Output:

   id value  similarity    x
0   1     a        0.95  one
1   2     b        0.92  one
2   3     c        0.92  one
3   4     d        0.92  two
4   5     e        0.99  two

weighted average

import numpy as np

out = (pd.concat([df1.assign(w=0.7), df2.assign(w=0.3)])
         .groupby(['id', 'value'])
         .apply(lambda x: np.average(x['similarity'], weights=x['w']),
                include_groups=False)
         .reset_index(name='weighted_avg')
      )

Output:

   id value  weighted_avg
0   1     a         0.950
1   2     b         0.920
2   3     c         0.916
3   4     d         0.920
4   5     e         0.990

or:

g = (pd.concat([df1.assign(w=0.7), df2.assign(w=0.3)])
       .eval('weighted_avg = similarity * w')
       .groupby(['id', 'value'])
    )
out = (g.first()
        .assign(weighted_avg=g['weighted_avg'].sum()/g['w'].sum())
        .reset_index()
      )

Output:

   id value  similarity    w  weighted_avg
0   1     a        0.95  0.7         0.950
1   2     b        0.92  0.7         0.920
2   3     c        0.91  0.7         0.916
3   4     d        0.92  0.3         0.920
4   5     e        0.99  0.3         0.990