Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Order each group by one column and get top elements depending on each group

I have a pandas dataframe with the following columns:

month userID probability top_users_that_month
9 A 5.231 3
9 B 0.963 3
9 E 7.445 3
9 D 9.343 3
9 C 3.231 3
10 X 0.963 1
10 B 7.445 1
10 A 2.343 1
13 Y 1.963 2
13 B 9.445 2
13 D 0.343 2

I would like to create a flag column, which has values 1 or 0, depending on if the userID was among the top_users (according to their probability column) for each month. Note that the number of top users in each month is indicated by the top_users_that_month column.

The desired output would be the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

month userID probability top_users_that_month flag_among_top_users
9 A 5.231 3 1
9 B 0.963 3 0
9 E 7.445 3 1
9 D 9.343 3 1
9 C 3.231 3 0
10 X 0.963 1 0
10 B 7.445 1 1
10 A 2.343 1 0
11 Y 1.963 2 1
11 B 9.445 2 1
11 D 0.343 2 0

I have tried to create the column using groupby and apply but didn’t get the result I wanted.

>Solution :

Since the number of users to be marked differ for each group, the easiest approach would be to use a custom function using groupby and apply.

For example:

def get_top_users(df):
    top_users = df['top_users_that_month'].iloc[0]
    idx = df.sort_values('probability', ascending=False).head(top_users).index
    df['flag_among_top_users'] = 0
    df.loc[idx, 'flag_among_top_users'] = 1
    return df
    
df.groupby('month').apply(get_top_users)

Result:

    month userID  probability  top_users_that_month  flag_among_top_users
0       9      A        5.231                     3                     1
1       9      B        0.963                     3                     0
2       9      E        7.445                     3                     1
3       9      D        9.343                     3                     1
4       9      C        3.231                     3                     0
5      10      X        0.963                     1                     0
6      10      B        7.445                     1                     1
7      10      A        2.343                     1                     0
8      13      Y        1.963                     2                     1
9      13      B        9.445                     2                     1
10     13      D        0.343                     2                     0

It is also possible to use nlargest instead of sorting inside get_top_users, but there could be issues if there are two identical probability values.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading