Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Rolling sum within 30 non-datetime days

I’ve been racking my brain trying to figure out the best way to do this. I want to find the rolling sum of the previous 30 days but my ‘day’ column is not in datetime format.

Sample data

df = pd.DataFrame({'client': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'], 
                   'day': [319, 323, 336, 352, 379, 424, 461, 486, 496, 499, 303, 334, 346, 373, 374, 395, 401, 408, 458, 492],
                   'foo': [5.0, 2.0, np.nan, np.nan, np.nan, np.nan, np.nan, 7.0, np.nan, np.nan, 8.0, 7.0, 22.0, np.nan, 13.0, np.nan, np.nan, 5.0, 11.0, np.nan]}

>>> df
   client  day   foo
0       A  319   5.0
1       A  323   2.0
2       A  336   NaN
3       A  352   NaN
4       A  379   NaN
5       A  424   NaN
6       A  461   NaN
7       A  486   7.0
8       A  496   NaN
9       A  499   NaN
10      B  303   8.0
11      B  334   7.0
12      B  346  22.0
13      B  373   NaN
14      B  374  13.0
15      B  395   NaN
16      B  401   NaN
17      B  408   5.0
18      B  458  11.0
19      B  492   NaN

I want a new column showing the rolling sum of ‘foo’ every 30 days.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

So far I’ve tried:

df['foo_30day'] = df.groupby('client').rolling(30, on='day', min_periods=1)['foo'].sum().values

But it looks like it’s taking the rolling sum of the last 30 rows.

I was also thinking of maybe changing the ‘day’ column to a datetime format, then using rolling('30D') but I’m not sure how or even if that’s the best approach. I’ve also tried to use a groupby reindex to stretch the ‘day’ column and doing a simple rolling(30) but it’s not working for me.

Any advice would be greatly appreciated.

>Solution :

You can convert the days to datetime then use window='30D'. The days will be converted to datetime, with a date starting from the first possible date in Pandas (1970-01-01).

You can optionally specify the starting date with an origin attribute in to_datetime, but for your example that wouldn’t matter.

df = df.sort_values("day").reset_index(drop=True)
df["day_datetime"] = pd.to_datetime(df["day"], unit="D")

df["foo_30day"] = (
    df.groupby("client")
    .rolling("30D", on="day_datetime", min_periods=1)["foo"]
    .sum()
    .values
)
   client  day   foo day_datetime  foo_30day
0       B  303   8.0   1970-10-31        5.0
1       A  319   5.0   1970-11-16        7.0
2       A  323   2.0   1970-11-20        7.0
3       B  334   7.0   1970-12-01        2.0
4       A  336   NaN   1970-12-03        NaN
5       B  346  22.0   1970-12-13        NaN
6       A  352   NaN   1970-12-19        NaN
7       B  373   NaN   1971-01-09        7.0
8       B  374  13.0   1971-01-10        7.0
9       A  379   NaN   1971-01-15        7.0
10      B  395   NaN   1971-01-31        8.0
11      B  401   NaN   1971-02-06        7.0
12      B  408   5.0   1971-02-13       29.0
13      A  424   NaN   1971-03-01       22.0
14      B  458  11.0   1971-04-04       35.0
15      A  461   NaN   1971-04-07       13.0
16      A  486   7.0   1971-05-02       13.0
17      B  492   NaN   1971-05-08        5.0
18      A  496   NaN   1971-05-12       11.0
19      A  499   NaN   1971-05-15        NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading