Is there a way to merge two Pandas DataFrames, by matching on (and retaining) supplied columns, but overwriting all the rest?
For example:
import pandas as pd
df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
>>> df1
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2023-01-01 2023-01-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2023-03-01 2023-03-01
3 John Male 24 2023-04-01 2023-04-01
>>> df2
Name Gender Age LastLogin LastPurchase
0 Steve Male 23 2022-11-01 2022-11-02
1 Simon Male 23 2023-03-01 2023-03-02
2 Gary Male 24 2023-04-01 2023-04-02
3 Bob Male 21 2022-12-01 2022-12-01
What I’d like is to end up with is df1 updated with values from df2, if the "Name", "Gender" and "Age" columns match. But without caring what the other columns are, so I’d end up with this:
>>> df1
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01 # Updated last two columns from df2
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02 # Updated last two columns from df2
3 John Male 24 2023-04-01 2023-04-01
I can do a merge like this:
>>> df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')
But then I have to manually extract data from and drop the new columns created from the merge, using their names:
>>> df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
>>> df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
>>> df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
>>> df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
>>>
>>> df3
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01
I’m trying to avoid this, as I need a generic way to update batches of data, and I don’t know all their column names (just the ones I want to match on).
>Solution :
You can avoid the _x/_y columns by slicing only the the merging keys in df1 for merge, then fillna/combine_first with the original:
cols = ["Name", "Gender", "Age"]
df3 = df1[cols].merge(df2, how='left').fillna(df1)
A more convoluted approach using indexes:
df3 = (df2.set_index(cols)
.combine_first(df1.set_index(cols))
.reindex(df1[cols]).reset_index()
)
Output:
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01