I have a pandas dataframe which has duplicate rows of the column symbol and looks like this:
df =
| Index | Symbol | Column A | Column B | Status |
|-------|----------|----------|----------|----------|
| 0 | x | a | np.nan | Default |
| 1 | y | d | c | Default |
| 2 | x | np.nan | e | Default |
I want to have all the duplicate rows merged to unique rows, with the np.nan values in Column A and Column B replaced by values from the other duplicate rows, and the Status value of the row that was merged/changed to be set to Merged.
I want the output to look like this
| Index | Symbol | Column A | Column B | Status |
|-------|----------|----------|----------|----------|
| 0 | x | a | e | Merged |
| 1 | y | d | c | Default |
I have tried to separate the dataframes into duplicate and merge them again
df1 = df.loc[df['Symbol'].duplicated()]
then
df = pd.concat([df, df1], join = 'inner', axis = 1)
Other than this I have no idea how to proceed.
>Solution :
You can group by symbol and use bfill to replace the missing values. Check if the group contains NaN’s and change the status to Merged. Finally, return the first row of the group.
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'Symbol': ['x', 'y', 'x'],
'Column A': ['a', 'd', np.nan],
'Column B': [np.nan, 'c', 'e'],
'Status': ['Default', 'Default', 'Default']
})
def merge_duplicates(group):
if group.isna().sum(axis=1).sum() > 0:
group.loc[group.index[0], 'Status'] = 'Merged'
group = group.fillna(method='bfill')
return group.iloc[0]
merged_df = df.groupby('Symbol').apply(merge_duplicates)
merged_df = merged_df.reset_index(drop=True)
merged_df
Output:
| Symbol | Column A | Column B | Status | |
|---|---|---|---|---|
| 0 | x | a | e | Merged |
| 1 | y | d | c | Default |