Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

get a count of vaues that are on or before a certain date from a pandas dataframe

I have a date 2020-05-31 and the following dataframe, where the column names are statuses:

     rejected revocation    decision         rfe   interview premium    received rfe_response  biometrics withdrawal appeal
196      None       None  2020-01-28        None        None    None  2020-01-16         None        None       None   None
203      None       None  2020-06-20  2020-04-01        None    None  2020-01-03   2020-08-08        None       None   None
209      None       None  2020-12-03  2020-06-03        None    None  2020-01-03         None        None       None   None
213      None       None  2020-06-23        None        None    None  2020-01-27         None  2020-02-19       None   None
1449     None       None  2020-05-12        None        None    None  2020-01-06         None        None       None   None
1660     None       None  2021-09-23  2021-05-27        None    None  2020-01-21   2021-08-17        None       None   None

I want to get the latest step each row is in, such that the latest steap is on or before the date mentioed above 2020-05-31

So the output for this woud be:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

196: decision
203: rfe
209: received
213: biometrics
1449: decision
1660: received

or even a count works:

{
rejected = 0,
revocation = 0,
decision = 2,
rfe = 1,
interview = 0,
premium = 0,
received = 2,
rfe_response = 0,
biometrics 0 0,
withdrawal = 0,
appeal = 0 
}

Currently i am looping through each row, where i create a dict of {status: date}, then i sort by date, and take the key of the last value (which is a status)

This is very slow and takes forever

Is there a simpler or cleaner way of doing it?

NOTE: Each row will have atleast one date, in decision column

>Solution :

you can mask where the date is bigger than the chosen date, then use idxmax along the columns.

dt_max = '2020-05-31'
res = df.where(df.le(dt_max)).astype('datetime64[ns]').idxmax(axis=1)
print(res)
# 196       decision
# 203            rfe
# 209       received
# 213     biometrics
# 1449      decision
# 1660      received
# dtype: object

And for the count, per status, then you can do with value_counts like

dict_res = res.value_counts().reindex(df.columns, fill_value=0).to_dict()
print(dict_res)
#{'rejected': 0, 'revocation': 0, 'decision': 2, 'rfe': 1, 'interview': 0, 'premium': 0, 
# 'received': 2, 'rfe_response': 0, 'biometrics': 1, 'withdrawal': 0, 'appeal': 0}
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading