Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to intersect values over multiple columns in pandas

Please find below my input/output (desired) :

INPTUT :

  Col1 Col2 Col3 Col4 Col5
0    A  NaN  NaN  NaN  NaN
1  NaN    B    C  NaN  NaN
2  NaN  NaN  NaN    D  NaN
3  NaN    E  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN    F

OUTPUT (desired) :

  Col1 Col2 Col3 Col4 Col5 Intersection
0    A  NaN  NaN  NaN  NaN            A
1  NaN    B    C  NaN  NaN       [B, C]
2  NaN  NaN  NaN    D  NaN            D
3  NaN    E  NaN  NaN  NaN            E
4  NaN  NaN  NaN  NaN    F            F

I’m trying to add a column Intersection that will hold the intersection of all the columns.

My dataset looks exactly like this dataframe (except that he has thousands of rows and dozens of columns) :

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

import pandas as pd
import numpy as np

data = {'Col1': ['A', np.nan, np.nan, np.nan, np.nan],
        'Col2': [np.nan, 'B', np.nan, 'E', np.nan],
        'Col3': [np.nan, 'C', np.nan, np.nan, np.nan],
        'Col4': [np.nan, np.nan, 'D', np.nan, np.nan],
        'Col5': [np.nan, np.nan, np.nan, np.nan, 'F']
       }

df = pd.DataFrame(data)

Do you have any suggestions, please ?

>Solution :

Use DataFrame.stack for reshape with remove missing values and aggregate custom function for avoid one element lists:

f = lambda x: list(x) if len(x) > 1 else x
df['Intersection'] = df.stack().groupby(level=0).agg(f)

print (df)
  Col1 Col2 Col3 Col4 Col5 Intersection
0    A  NaN  NaN  NaN  NaN            A
1  NaN    B    C  NaN  NaN       [B, C]
2  NaN  NaN  NaN    D  NaN            D
3  NaN    E  NaN  NaN  NaN            E
4  NaN  NaN  NaN  NaN    F            F

Or if one element list should be instead scalars use:

df['Intersection'] = df.stack().groupby(level=0).agg(list)

print (df)
  Col1 Col2 Col3 Col4 Col5 Intersection
0    A  NaN  NaN  NaN  NaN          [A]
1  NaN    B    C  NaN  NaN       [B, C]
2  NaN  NaN  NaN    D  NaN          [D]
3  NaN    E  NaN  NaN  NaN          [E]
4  NaN  NaN  NaN  NaN    F          [F]
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading