Cumulative sum with offset pandas

April 25, 2024

I am working on developing a machine learning model to predict the score for a given team. I want to create a column tracking each team’s cumulative score for their home games up to but not including the current game (row). I can easily calculate the cumulative total, but I want to offset the cumulative total to show the cumulative up to but not including the current game below is an example of the dataset. I would ideally like to create the cumulative column

game_id	game_date	home_id	home_score	cumulative
718730	2023-04-03	145	3	0
718695	2023-04-05	145	7	3
718687	2023-04-06	145	6	10
718683	2023-04-06	109	2	0
718671	2023-04-07	109	6	2
718656	2023-04-08	109	12	8

The code below is what I have done so far to create a cumulative total

import pandas as pd

data = pd.read_csv('game_data.csv')
data['home_cumulative'] = data.groupby('home_id')['home_score'].cumsum()

I have attempted the below with results not matching what I would excpect

data['home_offset'] = data.groupby('home_id')['home_score'].shift(periods = 1).cumsum().fillna(0)

>Solution :

You can use:

data['cumulative'] = (data.groupby('home_id')['home_score']
                    .transform(lambda x: x.cumsum().shift(1))
                    .fillna(0)
                    .astype(int))

to give:

   game_id    game_date  home_id  home_score  cumulative
0   718730  2023-04-03       145           3         0
1   718695  2023-04-05       145           7         3
2   718687  2023-04-06       145           6        10
3   718683  2023-04-06       109           2         0
4   718671  2023-04-07       109           6         2
5   718656  2023-04-08       109          12         8