Python Pandas DataFrame Merge on Columns with Overwrite

February 23, 2023

Is there a way to merge two Pandas DataFrames, by matching on (and retaining) supplied columns, but overwriting all the rest?

For example:

import pandas as pd

df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]

df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]

>>> df1
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2023-01-01   2023-01-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2023-03-01   2023-03-01
3   John   Male  24  2023-04-01   2023-04-01

>>> df2
    Name Gender Age   LastLogin LastPurchase
0  Steve   Male  23  2022-11-01   2022-11-02
1  Simon   Male  23  2023-03-01   2023-03-02
2   Gary   Male  24  2023-04-01   2023-04-02
3    Bob   Male  21  2022-12-01   2022-12-01

What I’d like is to end up with is df1 updated with values from df2, if the "Name", "Gender" and "Age" columns match. But without caring what the other columns are, so I’d end up with this:

>>> df1
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01 # Updated last two columns from df2
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02 # Updated last two columns from df2
3   John   Male  24  2023-04-01   2023-04-01

I can do a merge like this:

>>> df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')

But then I have to manually extract data from and drop the new columns created from the merge, using their names:

>>> df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
>>> df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
>>> df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
>>> df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
>>> 
>>> df3
    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02
3   John   Male  24  2023-04-01   2023-04-01

I’m trying to avoid this, as I need a generic way to update batches of data, and I don’t know all their column names (just the ones I want to match on).

>Solution :

You can avoid the _x/_y columns by slicing only the the merging keys in df1 for merge, then fillna/combine_first with the original:

cols = ["Name", "Gender", "Age"]

df3 = df1[cols].merge(df2, how='left').fillna(df1)

A more convoluted approach using indexes:

df3 = (df2.set_index(cols)
          .combine_first(df1.set_index(cols))
          .reindex(df1[cols]).reset_index()
       )

Output:

    Name Gender Age   LastLogin LastPurchase
0    Bob   Male  21  2022-12-01   2022-12-01
1  Frank   Male  22  2023-02-01   2023-02-01
2  Steve   Male  23  2022-11-01   2022-11-02
3   John   Male  24  2023-04-01   2023-04-01