Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

If a row contains at least two not NaN values, split the row into two separate ones

I am trying to convert datafarame to desired output format with requirements mentioned below.


Provided requirements:

  • Each row can only keep one not Nan value (except Trh1 and Trh2)
  • I want to avoid methods that iterate over each row for performance reasons.
  • I have only included four columns, for example, in a real scenario there are many more columns to share

Example:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Input:

Index Schema Column Trh1 Trh2 Trh3 Trh4
0 schema_1 col_1 NaN 0.01 NaN NaN
1 schema_2 col_2 0.02 0.03 NaN NaN
2 schema_3 col_3 0.03 0.04 0.05 NaN
3 schema_4 col_4 NaN NaN 0.06 0.07

Expected output:

Index Schema Column Trh1 Trh2 Trh3 Trh4
0 schema_1 col_1 NaN 0.01 NaN NaN
1 schema_2 col_2 0.02 0.03 NaN NaN
2 schema_3 col_3 0.03 0.04 NaN NaN
3 schema_3 col_3 NaN NaN 0.05 NaN
4 schema_4 col_4 NaN NaN 0.06 NaN
5 schema_4 col_4 NaN NaN NaN 0.07

I explored following approach: Split row into 2 based on condition pandas. However, this approach is only suitable for splitting a row if there are no Nan values in the two columns.

>Solution :

handling a jump

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']
special = ['Trh1', 'Trh2']
others = list(df.columns.difference(cols))


out = (df
   .assign(init=lambda d: d[others].isna().all(axis=1))
   [cols+['init']+others]
   .set_index(cols).stack().to_frame()
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   .set_index('n', append=True)[0]
   .unstack(-2)
   .reset_index()
)

out.loc[out['init'].isna(), special] = np.nan

out = out.drop(columns=['n', 'init'])

out = out.dropna(subset=special+others, how='all')

Output:

 Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04   NaN   NaN
3      2  schema_3  col_3   NaN   NaN  0.05   NaN
5      3  schema_4  col_4   NaN   NaN  0.06   NaN
6      3  schema_4  col_4   NaN   NaN   NaN  0.07

original answer

You can use reshaping with de-duplication, with stack/unstack:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df
   # stack and remove NaNs
   .set_index(cols).stack().to_frame()
   # deduplicate
   .assign(n=lambda d: d.groupby(level=range(df.index.ndim)).cumcount())
   # reshape to original shape
   .set_index('n', append=True)[0]
   .unstack(-2)
   # cleanup
   .reset_index()
   .drop(columns='n')
)

# add rows that were dropped because having no value
out = pd.concat([df[df[df.columns.difference(cols)].isna().all(axis=1)], out],
                ignore_index=True).sort_values(by='Index') # optional

NB. this requires no duplicates in the initial cols.

Or with melt, which might be more memory intensive but also more robust if you have duplicates:

cols = ['Index', 'Schema', 'Column', 'Trh1', 'Trh2']

out = (df.melt(cols)
       # drop NAs, except first row per group
       .loc[lambda d: d['value'].notna() | ~d[cols].duplicated()]
       # de-duplicate
       .assign(n=lambda d: d.groupby(cols, dropna=False).cumcount())
       # reshape
       .pivot(index=cols+['n'], columns='variable', values='value')
       # cleanup
       .reset_index().rename_axis(index=None, columns=None)
      )

Output:

   Index    Schema Column  Trh1  Trh2  Trh3  Trh4
0      0  schema_1  col_1   NaN  0.01   NaN   NaN
1      1  schema_2  col_2  0.02  0.03   NaN   NaN
2      2  schema_3  col_3  0.03  0.04  0.05   NaN
3      3  schema_4  col_4   NaN   NaN  0.06   NaN
4      3  schema_4  col_4   NaN   NaN   NaN  0.07
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading