Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas: Rolling mean using only the last update based on another column

I would like to perform a rolling mean while the mean excludes duplicates found in another column. Let me provide an example dataframe:

Date            Warehose       Value
10-01-1998      London          10
13-01-1998      London          13
15-01-1998      New York        37
12-02-1998      London          21
20-02-1998      New York        39
21-02-1998      New York        17

In this example, let’s say I like to perform 30-day rolling mean of Value but taking into account only the last update of the Warehouse location. The resulting dataframe is expected to be:

 Date         Value     Rolling_Mean
02-01-1998      10           10
05-01-1998      13           13
15-01-1998      37           20
12-02-1998      21           29           
20-02-1998      39           30 
21-02-1998      17           19

Could you help me finding an efficient solution to this problem? The data I have is relatively big so as efficient as possible is appreciated.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

Thanks in advance!

>Solution :

It’s a bit tricky. As rolling.apply works on Series only and you need both "Wharehose" and "Value" to perform the computation, you need to access the complete dataframe using a function (and a "global" variable, which is not super clean IMO):

df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df2 = df.set_index('Date')

def agg(s):
    return (df2.loc[s.index]
               .drop_duplicates(subset='Warehose', keep='last')
               ['Value'].mean()
           )

df['Rolling_Mean'] = (df.sort_values(by='Date')
                        .rolling('30d', on='Date')
                        ['Value']
                        .apply(agg, raw=False)
                      )

output:

        Date  Warehose  Value  Rolling_Mean
0 1998-01-10    London     10          10.0
1 1998-01-13    London     13          13.0
2 1998-01-15  New York     37          25.0
3 1998-02-12    London     21          29.0
4 1998-02-20  New York     39          30.0
5 1998-02-21  New York     17          19.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading