I am trying to convert datafarame to desired output format with requirements mentioned below.
Provided requirements:
- Each row can only keep one not Nan value (except Trh1 and Trh2)
- I want to avoid methods that iterate over each row for performance reasons.
- I have only included four columns, for example, in a real scenario there are many more columns to share
Example:
Input:
| Index | Schema | Column | Trh1 | Trh2 | Trh3 | Trh4 |
|---|---|---|---|---|---|---|
| 0 | schema_1 | col_1 | NaN | 0.01 | NaN | NaN |
| 1 | schema_2 | col_2 | 0.02 | 0.03 | NaN | NaN |
| 2 | schema_3 | col_3 | 0.03 | 0.04 | 0.05 | NaN |
| 3 | schema_4 | col_4 | NaN | NaN | 0.06 | 0.07 |
Expected output:
| Index | Schema | Column | Trh1 | Trh2 | Trh3 | Trh4 |
|---|---|---|---|---|---|---|
| 0 | schema_1 | col_1 | NaN | 0.01 | NaN | NaN |
| 1 | schema_2 | col_2 | 0.02 | 0.03 | NaN | NaN |
| 2 | schema_3 | col_3 | 0.03 | 0.04 | NaN | NaN |
| 3 | schema_3 | col_3 | NaN | NaN | 0.05 | NaN |
| 4 | schema_4 | col_4 | NaN | NaN | 0.06 | NaN |
| 5 | schema_4 | col_4 | NaN | NaN | NaN | 0.07 |
I explored following approach: Split row into 2 based on condition pandas. However, this approach is only suitable for splitting a row if there are no Nan values in the two columns.
>Solution :
handling a jump
cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']
special = ['Trh1', 'Trh2']
others = list(df.columns.difference(cols))
out = (df
.assign(init=lambda d: d[others].isna().all(axis=1))
[cols+['init']+others]
.set_index(cols).stack().to_frame()
.assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
.set_index('n', append=True)[0]
.unstack(-2)
.reset_index()
)
out.loc[out['init'].isna(), special] = np.nan
out = out.drop(columns=['n', 'init'])
out = out.dropna(subset=special+others, how='all')
Output:
Index Schema Column Trh1 Trh2 Trh3 Trh4
0 0 schema_1 col_1 NaN 0.01 NaN NaN
1 1 schema_2 col_2 0.02 0.03 NaN NaN
2 2 schema_3 col_3 0.03 0.04 NaN NaN
3 2 schema_3 col_3 NaN NaN 0.05 NaN
5 3 schema_4 col_4 NaN NaN 0.06 NaN
6 3 schema_4 col_4 NaN NaN NaN 0.07
original answer
You can use reshaping with de-duplication, with stack/unstack:
cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']
out = (df
# stack and remove NaNs
.set_index(cols).stack().to_frame()
# deduplicate
.assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
# reshape to original shape
.set_index('n', append=True)[0]
.unstack(-2)
# cleanup
.reset_index()
.drop(columns='n')
)
# add rows that were dropped because having no value
out = pd.concat([df[df[df.columns.difference(cols)].isna().all(axis=1)], out],
ignore_index=True).sort_values(by='Index') # optional
NB. this requires no duplicates in the initial cols.
Or with melt, which might be more memory intensive but also more robust if you have duplicates:
cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']
out = (df.melt(cols)
# drop NAs, except first row per group
.loc[lambda d: d['value'].notna() | ~d[cols].duplicated()]
# de-duplicate
.assign(n=lambda d: d.groupby(cols, dropna=False).cumcount())
# reshape
.pivot(index=cols+['n'], columns='variable', values='value')
# cleanup
.reset_index().rename_axis(index=None, columns=None)
)
Output:
Index Schema Column Trh1 Trh2 Trh3 Trh4
0 0 schema_1 col_1 NaN 0.01 NaN NaN
1 1 schema_2 col_2 0.02 0.03 NaN NaN
2 2 schema_3 col_3 0.03 0.04 0.05 NaN
3 3 schema_4 col_4 NaN NaN 0.06 NaN
4 3 schema_4 col_4 NaN NaN NaN 0.07