How to randomly split grouped dataframe in python

February 1, 2023

I have the next dataframe:

df = pd.DataFrame({
               "player_id":[1,1,2,2,3,3,4,4,5,5,6,6],
               "year"     :[1,2,1,2,1,2,1,2,1,2,1,2],
               "overall"  :[20,16,7,3,8,80,20,12,9,3,2,1]})

what is the easiest way to randomly sort it grouped by player_id, e.g.

player_id	year	overall
4	1	80
4	2	20
1	1	20
1	2	16
…	…	…

And then split it 80-20 into a train and testing set where they don’t share any player_id.

>Solution :

As Quang Hoang suggested in the comments. You can split your ids and then select the data based on those ids.

test_ids = df.player_id.drop_duplicates().sample(frac=0.2).values
#-> array([2])

train_data = df[~df["player_id"].isin(test_ids)]
"""
    player_id  year  overall
0           1     1       20
1           1     2       16
4           3     1        8
5           3     2       80
6           4     1       20
7           4     2       12
8           5     1        9
9           5     2        3
10          6     1        2
11          6     2        1
"""

test_data = df[df["player_id"].isin(test_ids)]
"""
   player_id  year  overall
2          2     1        7
3          2     2        3
"""