Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Merge pandas datasets with different indices and columns

I’m fairly sure I’m missing a version of this question elsewhere on the site, but cannot for the life of me find it. I can’t be the first person to have this issue.

In essence, I’m looking for a function that would merge pandas Dataframes in the equivalent way that a nested version of dict.merge would. If A and B are dictionaries of dictionaries, then we’d get C = {k: A.get(k, {}) | B.get(k, {}) for k in set().union(*[A, B])}

Concretely, I have two Pandas datasets A and B, with different sets of indices and columns (but not necessarily disjoint).

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

I would like a dataset C, whose indices are the union of A and B’s indices and whose columns are the union of A and B’s columns. The values are taken from B, if B has a non-NaN value for that particular index/column, otherwise it uses A.

I want a function SuperMerge: C = A.SuperMerge(B). Assume the dataframes have some method df.get(i, c, default=pd.NA) that returns df[i, c] if it exists otherwise it returns default.

SuperMerge works like this:

C[i,c] = B.get([i, c]) if not B.get([i, c]).isna() else A.get([i, c])

Based on the lack of questions that address this, and the lack of Pandas functionality, I assume this is extremely rare use-case, but it’s effectively how Python’s dict update works: the updated dictionary now has the union of keys from both dicts, taking values from the right dict if available else it takes the one from the left dict.

For example:

df1 = pd.DataFrame({'col1': ['a', 'b'], 'col2': [1, 2], 'col4': ['col4_x', 'col4_y']}, index=['x', 'y'])

  col1  col2    col4
x    a     1  col4_x
y    b     2  col4_y
df2 = pd.DataFrame({'col1': ['ALT', 'c'], 'col3': ['ONE', 'THREE'], 'col4': [pd.NA, 'col4_z']}, index=['x', 'z'])

  col1   col3    col4
x  ALT    ONE    <NA>
z    c  THREE  col4_z
df1.SuperMerge(df2)

  col1  col2   col3    col4
x  ALT     1    ONE  col4_x
y    b     2   <NA>  col4_y
z    c  <NA>  THREE  col4_z

>Solution :

Supermerge is essentially combine_first.

From the docs:

Combine two DataFrame objects by filling null values in one DataFrame
with non-null values from other DataFrame. The row and column indexes
of the resulting DataFrame will be the union of the two. The resulting
dataframe contains the ‘first’ dataframe values and overrides the
second one values where both first.loc[index, col] and
second.loc[index, col] are not missing values, upon calling
first.combine_first(second).

df2.combine_first(df1)

  col1  col2   col3    col4
x  ALT   1.0    ONE  col4_x
y    b   2.0    NaN  col4_y
z    c   NaN  THREE  col4_z
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading