Home How to find first occurence of a specified integer over multiple columns using Pandas?

Questions

How to find first occurence of a specified integer over multiple columns using Pandas?

December 7, 2022

I have this dataset:

        2010            2011            2012
0   NaN         NaN         505303.0
1   542225.0    NaN         210530.0
2   123210.0    429439.0    543964.0
3   434304.0    540325.0    NaN
4   750450.0    143430.0    540425.0
5   543015.0    549320.0    104365.0

and I want first to find the first digit for each cell like this (see MWE):

    2010    2011    2012
0   -   -   5
1   5   -   2
2   1   4   5
3   4   5   -
4   7   1   5
5   5   5   1

but finally I want to count the first occurence of 5 in each row, and which year it occured. If 5 occurs several places, I only want to know the first one. How do I accomplish this?

Below you will find the MWE:

import numpy as np

data = {"2010": [np.nan, 542225, 123210, 434304, 750450, 543015],
        "2011": [np.nan, np.nan, 429439, 540325, 143430, 549320],
        "2012": [505303, 210530, 543964, np.nan, 540425, 104365]
       }

df_t = pd.DataFrame(data)

for col in df_t.columns:
    df_t[col] = (df_t[col]
           .fillna(-1)
           .astype(str)
           .str[0]
           )

>Solution :

Your solution should be used with DataFrame.apply:

df = df_t.fillna(-1).astype(str).apply(lambda x: x.str[0])
print (df)
  2010 2011 2012
0    -    -    5
1    5    -    2
2    1    4    5
3    4    5    -
4    7    1    5
5    5    5    1

Then compare by string '5' and get first matched year by DataFrame.idxmax, if no match get None:

m = df.eq('5')
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
  2010 2011 2012  Year
0    -    -    5  2012
1    5    -    2  2010
2    1    4    5  2012
3    4    5    -  2011
4    7    1    5  2012
5    5    5    1  2010

Another idea with numeric only values:

df = df_t // (10 ** np.log10(df_t).fillna(1).astype(int))
print (df)
   2010  2011  2012
0   NaN   NaN   5.0
1   5.0   NaN   2.0
2   1.0   4.0   5.0
3   4.0   5.0   NaN
4   7.0   1.0   5.0
5   5.0   5.0   1.0

m = df.eq(5)
df['Year'] = m.idxmax(axis=1).where(m.any(axis=1), None)
print (df)
   2010  2011  2012  Year
0   NaN   NaN   5.0  2012
1   5.0   NaN   2.0  2010
2   1.0   4.0   5.0  2012
3   4.0   5.0   NaN  2011
4   7.0   1.0   5.0  2012
5   5.0   5.0   1.0  2010