I have a date 2020-05-31 and the following dataframe, where the column names are statuses:
rejected revocation decision rfe interview premium received rfe_response biometrics withdrawal appeal
196 None None 2020-01-28 None None None 2020-01-16 None None None None
203 None None 2020-06-20 2020-04-01 None None 2020-01-03 2020-08-08 None None None
209 None None 2020-12-03 2020-06-03 None None 2020-01-03 None None None None
213 None None 2020-06-23 None None None 2020-01-27 None 2020-02-19 None None
1449 None None 2020-05-12 None None None 2020-01-06 None None None None
1660 None None 2021-09-23 2021-05-27 None None 2020-01-21 2021-08-17 None None None
I want to get the latest step each row is in, such that the latest steap is on or before the date mentioed above 2020-05-31
So the output for this woud be:
196: decision
203: rfe
209: received
213: biometrics
1449: decision
1660: received
or even a count works:
{
rejected = 0,
revocation = 0,
decision = 2,
rfe = 1,
interview = 0,
premium = 0,
received = 2,
rfe_response = 0,
biometrics 0 0,
withdrawal = 0,
appeal = 0
}
Currently i am looping through each row, where i create a dict of {status: date}, then i sort by date, and take the key of the last value (which is a status)
This is very slow and takes forever
Is there a simpler or cleaner way of doing it?
NOTE: Each row will have atleast one date, in decision column
>Solution :
you can mask where the date is bigger than the chosen date, then use idxmax along the columns.
dt_max = '2020-05-31'
res = df.where(df.le(dt_max)).astype('datetime64[ns]').idxmax(axis=1)
print(res)
# 196 decision
# 203 rfe
# 209 received
# 213 biometrics
# 1449 decision
# 1660 received
# dtype: object
And for the count, per status, then you can do with value_counts like
dict_res = res.value_counts().reindex(df.columns, fill_value=0).to_dict()
print(dict_res)
#{'rejected': 0, 'revocation': 0, 'decision': 2, 'rfe': 1, 'interview': 0, 'premium': 0,
# 'received': 2, 'rfe_response': 0, 'biometrics': 1, 'withdrawal': 0, 'appeal': 0}