Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas: Find the left-most value in a pandas dataframe followed by all 1s

I have the following dataset

data = {'ID': ['A', 'B', 'C', 'D'], 
        '2012': [0, 1, 1, 1], 
        '2013': [0, 0, 1, 1], 
        '2014': [0, 0, 0, 1], 
        '2015': [0, 0, 1, 1], 
        '2016': [0, 0, 1, 0], 
        '2017': [1, 0, 1,1]}

df  = pd.DataFrame(data)

For each row I want to generate a new column – Baseline_Year – which assumes the name of the column with all values to the right that are equal to 1. In case there is not column with all the values equal to 1, I would like the Baseline_Year to be equal to missing.

See the expected results

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

data = {'ID': ['A', 'B', 'C', 'D', 'E'], 
        '2012': [0, 1, 1, 1, 1], 
        '2013': [0, 0, 1, 1, 1], 
        '2014': [0, 0, 0, 1, 1], 
        '2015': [0, 0, 1, 1, 1], 
        '2016': [0, 0, 1, 0, 1], 
        '2017': [1, 0, 1,1, 1],
        'Baseline_Year': [np.nan, np.nan, '2015','2017', '2012'],
       }

df_results  = pd.DataFrame(data)

df_results

>Solution :

I would use a boolean mask and idxmax:

# get year columns, identify rightmost 1s
m = (df.filter(regex=r'\d+')
       .loc[:, ::-1]
       .eq(1).cummin(axis=1)
       .loc[:, ::-1]
     )

df['Baseline_Year'] = m.idxmax(axis=1).where(m.any(axis=1))

Output:


  ID  2012  2013  2014  2015  2016  2017 Baseline_Year
0  A     0     0     0     0     0     1          2017
1  B     1     0     0     0     0     0           NaN
2  C     1     1     0     1     1     1          2015
3  D     1     1     1     1     0     1          2017

If you want a minimum number of 1s on the right:

N = 2

df['Baseline_Year'] = m.idxmax(axis=1).where(m.sum(axis=1).ge(N))

Output:


  ID  2012  2013  2014  2015  2016  2017 Baseline_Year
0  A     0     0     0     0     0     1           NaN
1  B     1     0     0     0     0     0           NaN
2  C     1     1     0     1     1     1          2015
3  D     1     1     1     1     0     1           NaN

Intermediate m:


    2012   2013   2014   2015   2016   2017
0  False  False  False  False  False   True
1  False  False  False  False  False  False
2  False  False  False   True   True   True
3  False  False  False  False  False   True
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading