Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Sum of rolling groupby based on date not returning expected results

I have the following DataFrame:

            match_date  player_id  player__match_count__won
1  2012-09-25 10:00:00      23640                         1
18 2012-09-25 10:00:00      17969                         0
2  2012-09-26 16:00:00      17268                         1
19 2012-09-26 16:00:00      11247                         0
0  2012-09-25 00:00:00      23640                         1
17 2012-09-25 00:00:00      17268                         0
3  2012-09-29 00:00:00      12202                         1
20 2012-09-29 00:00:00      23640                         0
5  2012-10-02 14:20:00       9241                         1
22 2012-10-02 14:20:00      23640                         0
6  2012-10-03 11:20:00      17268                         1
23 2012-10-03 11:20:00      23129                         0
4  2012-10-02 00:00:00      17268                         1
21 2012-10-02 00:00:00       8515                         0
7  2012-10-05 12:20:00      17268                         1
24 2012-10-05 12:20:00       9241                         0
8  2012-10-06 12:00:00      17268                         1
25 2012-10-06 12:00:00        940                         0
9  2012-10-07 11:00:00       6764                         1
26 2012-10-07 11:00:00      17268                         0
10 2012-10-15 00:00:00      23640                         1
27 2012-10-15 00:00:00      26418                         0
11 2012-10-16 00:00:00      23640                         1
28 2012-10-16 00:00:00       6764                         0
12 2012-10-19 14:00:00      14527                         1
29 2012-10-19 14:00:00      23640                         0
13 2012-10-20 13:20:00        469                         1
30 2012-10-20 13:20:00      17268                         0
14 2012-10-26 00:00:00      23640                         1
31 2012-10-26 00:00:00      11247                         0
15 2012-10-27 00:00:00       7507                         1
32 2012-10-27 00:00:00      23640                         0
16 2012-10-29 17:15:00         72                         1
33 2012-10-29 17:15:00      17268                         0

I would like to get the sum of player__match_count__won for a rolling 30 day period grouped by player ID excluding the current row in the rolling window. I’m using the below to return a DataFrame because in the real data I’ll be performing the sum on multiple columns:

df.groupby("player_id").rolling("30d", on="match_date", closed="left").sum().sort_values("match_date")

However, I get back the following:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

    player_id          match_date  player__match_count__won
0       23640 2012-09-25 00:00:00                       5.0 <- incorrect
17      17268 2012-09-25 00:00:00                       5.0 <- incorrect
1       23640 2012-09-25 10:00:00                       NaN
18      17969 2012-09-25 10:00:00                       NaN
2       17268 2012-09-26 16:00:00                       NaN
19      11247 2012-09-26 16:00:00                       NaN
20      23640 2012-09-29 00:00:00                       2.0
3       12202 2012-09-29 00:00:00                       NaN
4       17268 2012-10-02 00:00:00                       1.0
21       8515 2012-10-02 00:00:00                       NaN
22      23640 2012-10-02 14:20:00                       2.0
5        9241 2012-10-02 14:20:00                       NaN
6       17268 2012-10-03 11:20:00                       1.0
23      23129 2012-10-03 11:20:00                       NaN
24       9241 2012-10-05 12:20:00                       1.0
7       17268 2012-10-05 12:20:00                       3.0
8       17268 2012-10-06 12:00:00                       4.0
25        940 2012-10-06 12:00:00                       NaN
9        6764 2012-10-07 11:00:00                       NaN
26      17268 2012-10-07 11:00:00                       5.0
10      23640 2012-10-15 00:00:00                       2.0
27      26418 2012-10-15 00:00:00                       NaN
28       6764 2012-10-16 00:00:00                       1.0
11      23640 2012-10-16 00:00:00                       3.0
12      14527 2012-10-19 14:00:00                       NaN
29      23640 2012-10-19 14:00:00                       4.0
30      17268 2012-10-20 13:20:00                       5.0
13        469 2012-10-20 13:20:00                       NaN
31      11247 2012-10-26 00:00:00                       0.0
14      23640 2012-10-26 00:00:00                       2.0
32      23640 2012-10-27 00:00:00                       3.0
15       7507 2012-10-27 00:00:00                       NaN
33      17268 2012-10-29 17:15:00                       4.0
16         72 2012-10-29 17:15:00                       NaN

If I add in a sort on match_date upfront:

df.sort_values("match_date").groupby("player_id").rolling("30d", on="match_date", closed="left").sum().sort_values("match_date")

Then I get back the correct values:

    player_id          match_date  player__match_count__won
17      17268 2012-09-25 00:00:00                       NaN
0       23640 2012-09-25 00:00:00                       NaN
1       23640 2012-09-25 10:00:00                       1.0
18      17969 2012-09-25 10:00:00                       NaN
19      11247 2012-09-26 16:00:00                       NaN
2       17268 2012-09-26 16:00:00                       0.0
20      23640 2012-09-29 00:00:00                       2.0
3       12202 2012-09-29 00:00:00                       NaN
21       8515 2012-10-02 00:00:00                       NaN
4       17268 2012-10-02 00:00:00                       1.0
22      23640 2012-10-02 14:20:00                       2.0
5        9241 2012-10-02 14:20:00                       NaN
23      23129 2012-10-03 11:20:00                       NaN
6       17268 2012-10-03 11:20:00                       2.0
24       9241 2012-10-05 12:20:00                       1.0
7       17268 2012-10-05 12:20:00                       3.0
8       17268 2012-10-06 12:00:00                       4.0
25        940 2012-10-06 12:00:00                       NaN
9        6764 2012-10-07 11:00:00                       NaN
26      17268 2012-10-07 11:00:00                       5.0
10      23640 2012-10-15 00:00:00                       2.0
27      26418 2012-10-15 00:00:00                       NaN
28       6764 2012-10-16 00:00:00                       1.0
11      23640 2012-10-16 00:00:00                       3.0
12      14527 2012-10-19 14:00:00                       NaN
29      23640 2012-10-19 14:00:00                       4.0
30      17268 2012-10-20 13:20:00                       5.0
13        469 2012-10-20 13:20:00                       NaN
31      11247 2012-10-26 00:00:00                       0.0
14      23640 2012-10-26 00:00:00                       2.0
32      23640 2012-10-27 00:00:00                       3.0
15       7507 2012-10-27 00:00:00                       NaN
33      17268 2012-10-29 17:15:00                       4.0
16         72 2012-10-29 17:15:00                       NaN

However, the DataFrame was already sorted by match_date in the original example.

Why would this be happening?

Here’s the Dataframe in dictionary form for easy replication:

from pandas import Timestamp

df = {
    'match_date': {1: Timestamp('2012-09-25 10:00:00'), 18: Timestamp('2012-09-25 10:00:00'), 2: Timestamp('2012-09-26 16:00:00'), 19: Timestamp('2012-09-26 16:00:00'), 0: Timestamp('2012-09-25 00:00:00'), 17: Timestamp('2012-09-25 00:00:00'), 3: Timestamp('2012-09-29 00:00:00'), 20: Timestamp('2012-09-29 00:00:00'), 5: Timestamp('2012-10-02 14:20:00'), 22: Timestamp('2012-10-02 14:20:00'), 6: Timestamp('2012-10-03 11:20:00'), 23: Timestamp('2012-10-03 11:20:00'), 4: Timestamp('2012-10-02 00:00:00'), 21: Timestamp('2012-10-02 00:00:00'), 7: Timestamp('2012-10-05 12:20:00'), 24: Timestamp('2012-10-05 12:20:00'), 8: Timestamp('2012-10-06 12:00:00'), 25: Timestamp('2012-10-06 12:00:00'), 9: Timestamp('2012-10-07 11:00:00'), 26: Timestamp('2012-10-07 11:00:00'), 10: Timestamp('2012-10-15 00:00:00'), 27: Timestamp('2012-10-15 00:00:00'), 11: Timestamp('2012-10-16 00:00:00'), 28: Timestamp('2012-10-16 00:00:00'), 12: Timestamp('2012-10-19 14:00:00'), 29: Timestamp('2012-10-19 14:00:00'), 13: Timestamp('2012-10-20 13:20:00'), 30: Timestamp('2012-10-20 13:20:00'), 14: Timestamp('2012-10-26 00:00:00'), 31: Timestamp('2012-10-26 00:00:00'), 15: Timestamp('2012-10-27 00:00:00'), 32: Timestamp('2012-10-27 00:00:00'), 16: Timestamp('2012-10-29 17:15:00'), 33: Timestamp('2012-10-29 17:15:00')}, 
    'player_id': {1: 23640, 18: 17969, 2: 17268, 19: 11247, 0: 23640, 17: 17268, 3: 12202, 20: 23640, 5: 9241, 22: 23640, 6: 17268, 23: 23129, 4: 17268, 21: 8515, 7: 17268, 24: 9241, 8: 17268, 25: 940, 9: 6764, 26: 17268, 10: 23640, 27: 26418, 11: 23640, 28: 6764, 12: 14527, 29: 23640, 13: 469, 30: 17268, 14: 23640, 31: 11247, 15: 7507, 32: 23640, 16: 72, 33: 17268}, 
    'player__match_count__won': {1: 1, 18: 0, 2: 1, 19: 0, 0: 1, 17: 0, 3: 1, 20: 0, 5: 1, 22: 0, 6: 1, 23: 0, 4: 1, 21: 0, 7: 1, 24: 0, 8: 1, 25: 0, 9: 1, 26: 0, 10: 1, 27: 0, 11: 1, 28: 0, 12: 1, 29: 0, 13: 1, 30: 0, 14: 1, 31: 0, 15: 1, 32: 0, 16: 1, 33: 0}, 
}
df = pd.DataFrame(df)

>Solution :

You said

However, the DataFrame was already sorted by match_date in the original example

But after examining your data you provided at the end of your post, I found that are some 2012-09-25 dates coming after 2012-09-26 dates, and some 2012-10-02 dates coming after some 2012-10-03 dates.

Your data wasn’t sorted perfectly originally. So df.sort_values("match_date") makes it different, which is why you get different results.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading