Order each group by one column and get top elements depending on each group

November 17, 2021

I have a pandas dataframe with the following columns:

month	userID	probability	top_users_that_month
9	A	5.231	3
9	B	0.963	3
9	E	7.445	3
9	D	9.343	3
9	C	3.231	3
10	X	0.963	1
10	B	7.445	1
10	A	2.343	1
13	Y	1.963	2
13	B	9.445	2
13	D	0.343	2

I would like to create a flag column, which has values 1 or 0, depending on if the userID was among the top_users (according to their probability column) for each month. Note that the number of top users in each month is indicated by the top_users_that_month column.

The desired output would be the following:

month	userID	probability	top_users_that_month	flag_among_top_users
9	A	5.231	3	1
9	B	0.963	3	0
9	E	7.445	3	1
9	D	9.343	3	1
9	C	3.231	3	0
10	X	0.963	1	0
10	B	7.445	1	1
10	A	2.343	1	0
11	Y	1.963	2	1
11	B	9.445	2	1
11	D	0.343	2	0

I have tried to create the column using groupby and apply but didn’t get the result I wanted.

>Solution :

Since the number of users to be marked differ for each group, the easiest approach would be to use a custom function using groupby and apply.

For example:

def get_top_users(df):
    top_users = df['top_users_that_month'].iloc[0]
    idx = df.sort_values('probability', ascending=False).head(top_users).index
    df['flag_among_top_users'] = 0
    df.loc[idx, 'flag_among_top_users'] = 1
    return df
    
df.groupby('month').apply(get_top_users)

Result:

    month userID  probability  top_users_that_month  flag_among_top_users
0       9      A        5.231                     3                     1
1       9      B        0.963                     3                     0
2       9      E        7.445                     3                     1
3       9      D        9.343                     3                     1
4       9      C        3.231                     3                     0
5      10      X        0.963                     1                     0
6      10      B        7.445                     1                     1
7      10      A        2.343                     1                     0
8      13      Y        1.963                     2                     1
9      13      B        9.445                     2                     1
10     13      D        0.343                     2                     0

It is also possible to use nlargest instead of sorting inside get_top_users, but there could be issues if there are two identical probability values.