Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Optimization of map, in grouped by object

I have the following dataframe

test_df = pd.DataFrame({'Category': {0: 'product-availability address-confirmation input',
  1: 'registration register-data-confirmation options',
  2: 'onboarding return-start input',
  3: 'registration register-data-confirmation input',
  4: 'decision-tree first-interaction-validation options'},
 'Original_UserId': {0: '5511949551865@wa.gw.msging.net',
  1: '5511949551865@wa.gw.msging.net',
  2: '5511949551865@wa.gw.msging.net',
  3: '5511949551865@wa.gw.msging.net',
  4: '5511949551865@wa.gw.msging.net'}})

Thank to jezrael I am applying the following map, which follows the logic given in this question After certain string is found mark every after string as true,pandas

test_df.groupby('Original_UserId',observed=True)['Category'].apply(lambda s : s.eq('onboarding return-start input').cummax())

Which returns the following series

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

pd.Series({0: False, 1: False, 2: True, 3: True, 4: True})

The thing is when I apply this condition, to a larger dataset it takes quite a while to run this code. Any clues on how to optimize?

>Solution :

First compare column Category and then use GroupBy.cummax per column Original_UserId:

s = (test_df['Category'].eq('onboarding return-start input')
                        .groupby(test_df['Original_UserId'],observed=True)
                        .cummax())
print (s)
0    False
1    False
2     True
3     True
4     True
Name: Category, dtype: bool

Another idea is create helper column:

s = (test_df.assign(tmp = test_df['Category'].eq('onboarding return-start input'))
            .groupby('Original_UserId',observed=True)['tmp']
            .cummax())
print (s)
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading