Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to merge duplicate rows of a dataframe containing np.nan in different columns to one row with just the not nan values?

I have a pandas dataframe which has duplicate rows of the column symbol and looks like this:

df = 

| Index |  Symbol  | Column A | Column B |  Status  |
|-------|----------|----------|----------|----------|
|   0   |    x     |     a    |  np.nan  |  Default |
|   1   |    y     |     d    |     c    |  Default |
|   2   |    x     |  np.nan  |     e    |  Default |


I want to have all the duplicate rows merged to unique rows, with the np.nan values in Column A and Column B replaced by values from the other duplicate rows, and the Status value of the row that was merged/changed to be set to Merged.
I want the output to look like this

| Index |  Symbol  | Column A | Column B |  Status  |
|-------|----------|----------|----------|----------|
|   0   |    x     |     a    |     e    |  Merged  |
|   1   |    y     |     d    |     c    |  Default |



I have tried to separate the dataframes into duplicate and merge them again

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

df1 = df.loc[df['Symbol'].duplicated()]
then
df = pd.concat([df, df1], join = 'inner', axis = 1)

Other than this I have no idea how to proceed.

>Solution :

You can group by symbol and use bfill to replace the missing values. Check if the group contains NaN’s and change the status to Merged. Finally, return the first row of the group.

Code:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'Symbol': ['x', 'y', 'x'],
    'Column A': ['a', 'd', np.nan],
    'Column B': [np.nan, 'c', 'e'],
    'Status': ['Default', 'Default', 'Default']
})

def merge_duplicates(group):
    if group.isna().sum(axis=1).sum() > 0:
        group.loc[group.index[0], 'Status'] = 'Merged'
    group = group.fillna(method='bfill')
    return group.iloc[0]


merged_df = df.groupby('Symbol').apply(merge_duplicates)
merged_df = merged_df.reset_index(drop=True)

merged_df

Output:

Symbol Column A Column B Status
0 x a e Merged
1 y d c Default
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading