Optimization of map, in grouped by object

January 30, 2023

I have the following dataframe

test_df = pd.DataFrame({'Category': {0: 'product-availability address-confirmation input',
  1: 'registration register-data-confirmation options',
  2: 'onboarding return-start input',
  3: 'registration register-data-confirmation input',
  4: 'decision-tree first-interaction-validation options'},
 'Original_UserId': {0: '5511949551865@wa.gw.msging.net',
  1: '5511949551865@wa.gw.msging.net',
  2: '5511949551865@wa.gw.msging.net',
  3: '5511949551865@wa.gw.msging.net',
  4: '5511949551865@wa.gw.msging.net'}})

Thank to jezrael I am applying the following map, which follows the logic given in this question After certain string is found mark every after string as true,pandas

test_df.groupby('Original_UserId',observed=True)['Category'].apply(lambda s : s.eq('onboarding return-start input').cummax())

Which returns the following series

pd.Series({0: False, 1: False, 2: True, 3: True, 4: True})

The thing is when I apply this condition, to a larger dataset it takes quite a while to run this code. Any clues on how to optimize?

>Solution :

First compare column Category and then use GroupBy.cummax per column Original_UserId:

s = (test_df['Category'].eq('onboarding return-start input')
                        .groupby(test_df['Original_UserId'],observed=True)
                        .cummax())
print (s)
0    False
1    False
2     True
3     True
4     True
Name: Category, dtype: bool

Another idea is create helper column:

s = (test_df.assign(tmp = test_df['Category'].eq('onboarding return-start input'))
            .groupby('Original_UserId',observed=True)['tmp']
            .cummax())
print (s)