Is there a way to merge two Pandas DataFrames, by matching on (and retaining) supplied columns, but overwriting all the rest?
For example:
import pandas as pd
df1 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df1.loc[0] = ["Bob", "Male", "21", "2023-01-01", "2023-01-01"]
df1.loc[1] = ["Frank", "Male", "22", "2023-02-01", "2023-02-01"]
df1.loc[2] = ["Steve", "Male", "23", "2023-03-01", "2023-03-01"]
df1.loc[3] = ["John", "Male", "24", "2023-04-01", "2023-04-01"]
df2 = pd.DataFrame(columns=["Name", "Gender", "Age", "LastLogin", "LastPurchase"])
df2.loc[0] = ["Steve", "Male", "23", "2022-11-01", "2022-11-02"]
df2.loc[1] = ["Simon", "Male", "23", "2023-03-01", "2023-03-02"]
df2.loc[2] = ["Gary", "Male", "24", "2023-04-01", "2023-04-02"]
df2.loc[3] = ["Bob", "Male", "21", "2022-12-01", "2022-12-01"]
>>> df1
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2023-01-01 2023-01-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2023-03-01 2023-03-01
3 John Male 24 2023-04-01 2023-04-01
>>> df2
Name Gender Age LastLogin LastPurchase
0 Steve Male 23 2022-11-01 2022-11-02
1 Simon Male 23 2023-03-01 2023-03-02
2 Gary Male 24 2023-04-01 2023-04-02
3 Bob Male 21 2022-12-01 2022-12-01
What I’d like is to end up with is df1
updated with values from df2
, if the "Name"
, "Gender"
and "Age"
columns match. But without caring what the other columns are, so I’d end up with this:
>>> df1
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01 # Updated last two columns from df2
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02 # Updated last two columns from df2
3 John Male 24 2023-04-01 2023-04-01
I can do a merge like this:
>>> df3 = df1.merge(df2, on=["Name", "Gender", "Age"], how='left')
But then I have to manually extract data from and drop the new columns created from the merge, using their names:
>>> df3['LastLogin'] = df3['LastLogin_y'].fillna(df3['LastLogin_x'])
>>> df3['LastPurchase'] = df3['LastPurchase_y'].fillna(df3['LastPurchase_x'])
>>> df3.drop(['LastLogin_x', 'LastLogin_y'], axis=1, inplace=True)
>>> df3.drop(['LastPurchase_x', 'LastPurchase_y'], axis=1, inplace=True)
>>>
>>> df3
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01
I’m trying to avoid this, as I need a generic way to update batches of data, and I don’t know all their column names (just the ones I want to match on).
>Solution :
You can avoid the _x
/_y
columns by slicing only the the merging keys in df1
for merge
, then fillna
/combine_first
with the original:
cols = ["Name", "Gender", "Age"]
df3 = df1[cols].merge(df2, how='left').fillna(df1)
A more convoluted approach using indexes:
df3 = (df2.set_index(cols)
.combine_first(df1.set_index(cols))
.reindex(df1[cols]).reset_index()
)
Output:
Name Gender Age LastLogin LastPurchase
0 Bob Male 21 2022-12-01 2022-12-01
1 Frank Male 22 2023-02-01 2023-02-01
2 Steve Male 23 2022-11-01 2022-11-02
3 John Male 24 2023-04-01 2023-04-01