I am working on developing a machine learning model to predict the score for a given team. I want to create a column tracking each team’s cumulative score for their home games up to but not including the current game (row). I can easily calculate the cumulative total, but I want to offset the cumulative total to show the cumulative up to but not including the current game below is an example of the dataset. I would ideally like to create the cumulative column
| game_id | game_date | home_id | home_score | cumulative |
|---|---|---|---|---|
| 718730 | 2023-04-03 | 145 | 3 | 0 |
| 718695 | 2023-04-05 | 145 | 7 | 3 |
| 718687 | 2023-04-06 | 145 | 6 | 10 |
| 718683 | 2023-04-06 | 109 | 2 | 0 |
| 718671 | 2023-04-07 | 109 | 6 | 2 |
| 718656 | 2023-04-08 | 109 | 12 | 8 |
The code below is what I have done so far to create a cumulative total
import pandas as pd
data = pd.read_csv('game_data.csv')
data['home_cumulative'] = data.groupby('home_id')['home_score'].cumsum()
I have attempted the below with results not matching what I would excpect
data['home_offset'] = data.groupby('home_id')['home_score'].shift(periods = 1).cumsum().fillna(0)
>Solution :
You can use:
data['cumulative'] = (data.groupby('home_id')['home_score']
.transform(lambda x: x.cumsum().shift(1))
.fillna(0)
.astype(int))
to give:
game_id game_date home_id home_score cumulative
0 718730 2023-04-03 145 3 0
1 718695 2023-04-05 145 7 3
2 718687 2023-04-06 145 6 10
3 718683 2023-04-06 109 2 0
4 718671 2023-04-07 109 6 2
5 718656 2023-04-08 109 12 8