Home How to merge duplicate rows of a dataframe containing np.nan in different columns to one row with just the not nan values?

Questions

How to merge duplicate rows of a dataframe containing np.nan in different columns to one row with just the not nan values?

byMR

February 28, 2023

I have a pandas dataframe which has duplicate rows of the column symbol and looks like this:

df = 

| Index |  Symbol  | Column A | Column B |  Status  |
|-------|----------|----------|----------|----------|
|   0   |    x     |     a    |  np.nan  |  Default |
|   1   |    y     |     d    |     c    |  Default |
|   2   |    x     |  np.nan  |     e    |  Default |

I want to have all the duplicate rows merged to unique rows, with the np.nan values in Column A and Column B replaced by values from the other duplicate rows, and the Status value of the row that was merged/changed to be set to Merged.
I want the output to look like this

| Index |  Symbol  | Column A | Column B |  Status  |
|-------|----------|----------|----------|----------|
|   0   |    x     |     a    |     e    |  Merged  |
|   1   |    y     |     d    |     c    |  Default |

I have tried to separate the dataframes into duplicate and merge them again

df1 = df.loc[df['Symbol'].duplicated()]
then
df = pd.concat([df, df1], join = 'inner', axis = 1)

Other than this I have no idea how to proceed.

>Solution :

You can group by symbol and use bfill to replace the missing values. Check if the group contains NaN’s and change the status to Merged. Finally, return the first row of the group.

Code:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Symbol': ['x', 'y', 'x'],
    'Column A': ['a', 'd', np.nan],
    'Column B': [np.nan, 'c', 'e'],
    'Status': ['Default', 'Default', 'Default']
})

def merge_duplicates(group):
    if group.isna().sum(axis=1).sum() > 0:
        group.loc[group.index[0], 'Status'] = 'Merged'
    group = group.fillna(method='bfill')
    return group.iloc[0]


merged_df = df.groupby('Symbol').apply(merge_duplicates)
merged_df = merged_df.reset_index(drop=True)

merged_df

Output: