Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to find first occurence of a specified integer over multiple columns using Pandas?

I have this dataset:

        2010            2011            2012
0   NaN         NaN         505303.0
1   542225.0    NaN         210530.0
2   123210.0    429439.0    543964.0
3   434304.0    540325.0    NaN
4   750450.0    143430.0    540425.0
5   543015.0    549320.0    104365.0

and I want first to find the first digit for each cell like this (see MWE):

    2010    2011    2012
0   -   -   5
1   5   -   2
2   1   4   5
3   4   5   -
4   7   1   5
5   5   5   1

but finally I want to count the first occurence of 5 in each row, and which year it occured. If 5 occurs several places, I only want to know the first one. How do I accomplish this?

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Below you will find the MWE:

import numpy as np

data = {"2010": [np.nan, 542225, 123210, 434304, 750450, 543015],
        "2011": [np.nan, np.nan, 429439, 540325, 143430, 549320],
        "2012": [505303, 210530, 543964, np.nan, 540425, 104365]
       }

df_t = pd.DataFrame(data)

for col in df_t.columns:
    df_t[col] = (df_t[col]
           .fillna(-1)
           .astype(str)
           .str[0]
           )

>Solution :

Your solution should be used with DataFrame.apply:

df = df_t.fillna(-1).astype(str).apply(lambda x: x.str[0])
print (df)
  2010 2011 2012
0    -    -    5
1    5    -    2
2    1    4    5
3    4    5    -
4    7    1    5
5    5    5    1

Then compare by string '5' and get first matched year by DataFrame.idxmax, if no match get None:

m = df.eq('5')
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
  2010 2011 2012  Year
0    -    -    5  2012
1    5    -    2  2010
2    1    4    5  2012
3    4    5    -  2011
4    7    1    5  2012
5    5    5    1  2010

Another idea with numeric only values:

df = df_t // (10 ** np.log10(df_t).fillna(1).astype(int))
print (df)
   2010  2011  2012
0   NaN   NaN   5.0
1   5.0   NaN   2.0
2   1.0   4.0   5.0
3   4.0   5.0   NaN
4   7.0   1.0   5.0
5   5.0   5.0   1.0

m = df.eq(5)
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
   2010  2011  2012  Year
0   NaN   NaN   5.0  2012
1   5.0   NaN   2.0  2010
2   1.0   4.0   5.0  2012
3   4.0   5.0   NaN  2011
4   7.0   1.0   5.0  2012
5   5.0   5.0   1.0  2010
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading