Home If a row contains at least two not NaN values, split the row into two separate ones

Questions

If a row contains at least two not NaN values, split the row into two separate ones

byMR

October 13, 2023

I am trying to convert datafarame to desired output format with requirements mentioned below.

Provided requirements:

Each row can only keep one not Nan value (except Trh1 and Trh2)
I want to avoid methods that iterate over each row for performance reasons.
I have only included four columns, for example, in a real scenario there are many more columns to share

Example:

Input:

Index	Schema	Column	Trh1	Trh2	Trh3	Trh4
0	schema_1	col_1	NaN	0.01	NaN	NaN
1	schema_2	col_2	0.02	0.03	NaN	NaN
2	schema_3	col_3	0.03	0.04	0.05	NaN
3	schema_4	col_4	NaN	NaN	0.06	0.07

Expected output:

Index	Schema	Column	Trh1	Trh2	Trh3	Trh4
0	schema_1	col_1	NaN	0.01	NaN	NaN
1	schema_2	col_2	0.02	0.03	NaN	NaN
2	schema_3	col_3	0.03	0.04	NaN	NaN
3	schema_3	col_3	NaN	NaN	0.05	NaN
4	schema_4	col_4	NaN	NaN	0.06	NaN
5	schema_4	col_4	NaN	NaN	NaN	0.07

I explored following approach: Split row into 2 based on condition pandas. However, this approach is only suitable for splitting a row if there are no Nan values in the two columns.

>Solution :

handling a jump

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']
special = ['Trh1', 'Trh2']
others = list(df.columns.difference(cols))


out = (df
   .assign(init=lambda d: d[others].isna().all(axis=1))
   [cols+['init']+others]
   .set_index(cols).stack().to_frame()
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   .set_index('n', append=True)[0]
   .unstack(-2)
   .reset_index()
)

out.loc[out['init'].isna(), special] = np.nan

out = out.drop(columns=['n', 'init'])

out = out.dropna(subset=special+others, how='all')

Output:

 Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04   NaN   NaN
3      2  schema_3  col_3   NaN   NaN  0.05   NaN
5      3  schema_4  col_4   NaN   NaN  0.06   NaN
6      3  schema_4  col_4   NaN   NaN   NaN  0.07

original answer

You can use reshaping with de-duplication, with stack/unstack:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df
   # stack and remove NaNs
   .set_index(cols).stack().to_frame()
   # deduplicate
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   # reshape to original shape
   .set_index('n', append=True)[0]
   .unstack(-2)
   # cleanup
   .reset_index()
   .drop(columns='n')
)

# add rows that were dropped because having no value
out = pd.concat([df[df[df.columns.difference(cols)].isna().all(axis=1)], out],
                ignore_index=True).sort_values(by='Index') # optional

NB. this requires no duplicates in the initial cols.

Or with melt, which might be more memory intensive but also more robust if you have duplicates:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df.melt(cols)
       # drop NAs, except first row per group
       .loc[lambda d: d['value'].notna() | ~d[cols].duplicated()]
       # de-duplicate
       .assign(n=lambda d: d.groupby(cols, dropna=False).cumcount())
       # reshape
       .pivot(index=cols+['n'], columns='variable', values='value')
       # cleanup
       .reset_index().rename_axis(index=None, columns=None)
      )

Output:

   Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04  0.05   NaN
3      3  schema_4  col_4   NaN   NaN  0.06   NaN
4      3  schema_4  col_4   NaN   NaN   NaN  0.07