Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Year over Year difference and selecting maximum row in pandas

I have a dataframe given as below:

ID YEAR     NPS
500 2020     0
500 2021     0
500 2022     0
501 2020    32      
501 2021    52      
501 2022    99      
503 2021    1       
503 2022    4       
504 2020    45      
504 2021    55      
504 2022    50      

I have to calculate year over year difference as given below:

ID      YEAR    NPS  nps_gain_yoy
500     2020     0      0
500     2021     0      0
500     2022     0      0
501     2020    32      0
501     2021    52      20
501     2022    99      47
503     2021    1       0
503     2022    4       3
504     2020    45      0
504     2021    55      10
504     2022    50      -5

In above output for starting year 2020 or first occurance of Id nps_gain_yoy needs to be zero then for 2021 nps_gain_yoy is difference between nps of 2021 and 2020 i.e 52-32 = 20 as shown in output for ID 501 for year 2021 and so on.
After this I need to pick the maximum difference or maximum nps_gain_yoy for each ID as given in below output:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

ID  YEAR    NPS NPS_gain_yoy
501 2022     0    0
501 2022    99   47
503 2022    4    3
504 2021    55   10

Here 47 is the maximum nps gain for ID 501 in year 2022 similarly 3 for ID 503 and 4 for Id 504.

Please help I need it urgently.

>Solution :

If years are consecutive per ID first use DataFrameGroupBy.diff:

df = df.sort_values(['ID','YEAR'])

df['nps_gain_yoy'] = df.groupby('ID')['NPS'].diff().fillna(0)
print (df)
     ID  YEAR  NPS  nps_gain_yoy
0   500  2020    0           0.0
1   500  2021    0           0.0
2   500  2022    0           0.0
3   501  2020   32           0.0
4   501  2021   52          20.0
5   501  2022   99          47.0
6   503  2021    1           0.0
7   503  2022    4           3.0
8   504  2020   45           0.0
9   504  2021   55          10.0
10  504  2022   50          -5.0

And then DataFrameGroupBy.idxmax with DataFrame.loc:

df1 = df.loc[df.iloc[::-1].groupby('ID')['nps_gain_yoy'].idxmax()]
#alternative solution
#df1 = df.sort_values(['ID','nps_gain_yoy']).drop_duplicates('ID', keep='last')
print (df1)
    ID  YEAR  NPS  nps_gain_yoy
2  500  2022    0           0.0
5  501  2022   99          47.0
7  503  2022    4           3.0
9  504  2021   55          10.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading