I’m fairly sure I’m missing a version of this question elsewhere on the site, but cannot for the life of me find it. I can’t be the first person to have this issue.
In essence, I’m looking for a function that would merge pandas Dataframes in the equivalent way that a nested version of dict.merge would. If A and B are dictionaries of dictionaries, then we’d get C = {k: A.get(k, {}) | B.get(k, {}) for k in set().union(*[A, B])}
Concretely, I have two Pandas datasets A and B, with different sets of indices and columns (but not necessarily disjoint).
I would like a dataset C, whose indices are the union of A and B’s indices and whose columns are the union of A and B’s columns. The values are taken from B, if B has a non-NaN value for that particular index/column, otherwise it uses A.
I want a function SuperMerge: C = A.SuperMerge(B). Assume the dataframes have some method df.get(i, c, default=pd.NA) that returns df[i, c] if it exists otherwise it returns default.
SuperMerge works like this:
C[i,c] = B.get([i, c]) if not B.get([i, c]).isna() else A.get([i, c])
Based on the lack of questions that address this, and the lack of Pandas functionality, I assume this is extremely rare use-case, but it’s effectively how Python’s dict update works: the updated dictionary now has the union of keys from both dicts, taking values from the right dict if available else it takes the one from the left dict.
For example:
df1 = pd.DataFrame({'col1': ['a', 'b'], 'col2': [1, 2], 'col4': ['col4_x', 'col4_y']}, index=['x', 'y'])
col1 col2 col4
x a 1 col4_x
y b 2 col4_y
df2 = pd.DataFrame({'col1': ['ALT', 'c'], 'col3': ['ONE', 'THREE'], 'col4': [pd.NA, 'col4_z']}, index=['x', 'z'])
col1 col3 col4
x ALT ONE <NA>
z c THREE col4_z
df1.SuperMerge(df2)
col1 col2 col3 col4
x ALT 1 ONE col4_x
y b 2 <NA> col4_y
z c <NA> THREE col4_z
>Solution :
Supermerge is essentially combine_first.
From the docs:
Combine two DataFrame objects by filling null values in one DataFrame
with non-null values from other DataFrame. The row and column indexes
of the resulting DataFrame will be the union of the two. The resulting
dataframe contains the ‘first’ dataframe values and overrides the
second one values where both first.loc[index, col] and
second.loc[index, col] are not missing values, upon calling
first.combine_first(second).
df2.combine_first(df1)
col1 col2 col3 col4
x ALT 1.0 ONE col4_x
y b 2.0 NaN col4_y
z c NaN THREE col4_z