Rolling sum within 30 non-datetime days

March 22, 2024

I’ve been racking my brain trying to figure out the best way to do this. I want to find the rolling sum of the previous 30 days but my ‘day’ column is not in datetime format.

Sample data

df = pd.DataFrame({'client': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'], 
                   'day': [319, 323, 336, 352, 379, 424, 461, 486, 496, 499, 303, 334, 346, 373, 374, 395, 401, 408, 458, 492],
                   'foo': [5.0, 2.0, np.nan, np.nan, np.nan, np.nan, np.nan, 7.0, np.nan, np.nan, 8.0, 7.0, 22.0, np.nan, 13.0, np.nan, np.nan, 5.0, 11.0, np.nan]}

>>> df
   client  day   foo
0       A  319   5.0
1       A  323   2.0
2       A  336   NaN
3       A  352   NaN
4       A  379   NaN
5       A  424   NaN
6       A  461   NaN
7       A  486   7.0
8       A  496   NaN
9       A  499   NaN
10      B  303   8.0
11      B  334   7.0
12      B  346  22.0
13      B  373   NaN
14      B  374  13.0
15      B  395   NaN
16      B  401   NaN
17      B  408   5.0
18      B  458  11.0
19      B  492   NaN

I want a new column showing the rolling sum of ‘foo’ every 30 days.

So far I’ve tried:

df['foo_30day'] = df.groupby('client').rolling(30, on='day', min_periods=1)['foo'].sum().values

But it looks like it’s taking the rolling sum of the last 30 rows.

I was also thinking of maybe changing the ‘day’ column to a datetime format, then using rolling('30D') but I’m not sure how or even if that’s the best approach. I’ve also tried to use a groupby reindex to stretch the ‘day’ column and doing a simple rolling(30) but it’s not working for me.

Any advice would be greatly appreciated.

>Solution :

You can convert the days to datetime then use window='30D'. The days will be converted to datetime, with a date starting from the first possible date in Pandas (1970-01-01).

You can optionally specify the starting date with an origin attribute in to_datetime, but for your example that wouldn’t matter.

df = df.sort_values("day").reset_index(drop=True)
df["day_datetime"] = pd.to_datetime(df["day"], unit="D")

df["foo_30day"] = (
    df.groupby("client")
    .rolling("30D", on="day_datetime", min_periods=1)["foo"]
    .sum()
    .values
)

   client  day   foo day_datetime  foo_30day
0       B  303   8.0   1970-10-31        5.0
1       A  319   5.0   1970-11-16        7.0
2       A  323   2.0   1970-11-20        7.0
3       B  334   7.0   1970-12-01        2.0
4       A  336   NaN   1970-12-03        NaN
5       B  346  22.0   1970-12-13        NaN
6       A  352   NaN   1970-12-19        NaN
7       B  373   NaN   1971-01-09        7.0
8       B  374  13.0   1971-01-10        7.0
9       A  379   NaN   1971-01-15        7.0
10      B  395   NaN   1971-01-31        8.0
11      B  401   NaN   1971-02-06        7.0
12      B  408   5.0   1971-02-13       29.0
13      A  424   NaN   1971-03-01       22.0
14      B  458  11.0   1971-04-04       35.0
15      A  461   NaN   1971-04-07       13.0
16      A  486   7.0   1971-05-02       13.0
17      B  492   NaN   1971-05-08        5.0
18      A  496   NaN   1971-05-12       11.0
19      A  499   NaN   1971-05-15        NaN