Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Pandas groupby transform mean with date before current row for huge dataframe

I have a Pandas dataframe that looks like

df = pd.DataFrame([['John', 'A', '1/1/2017', '10'],
                   ['John', 'A', '2/2/2017', '15'],
                   ['John', 'A', '2/2/2017', '20'],
                   ['John', 'A', '3/3/2017', '30'],
                   ['Sue', 'B', '1/1/2017', '10'],
                   ['Sue', 'B', '2/2/2017', '15'],
                   ['Sue', 'B', '3/2/2017', '20'],
                   ['Sue', 'B', '3/3/2017', '7'],
                   ['Sue', 'B', '4/4/2017', '20']],
                  columns=['Customer', 'Group', 'Deposit_Date', 'DPD'])

And I want to create a new row called PreviousMean. This column is the year to date average of DPD for that customer. i.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it’s null or 0.

So the desired outcome looks like

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

  Customer  Group  Deposit_Date  DPD  PreviousMean
0     John      A    2017-01-01   10           NaN
1     John      A    2017-02-02   15          10.0
2     John      A    2017-02-02   20          10.0
3     John      A    2017-03-03   30          15.0
4      Sue      B    2017-01-01   10           NaN
5      Sue      B    2017-02-02   15          10.0
6      Sue      B    2017-03-02   20          12.5
7      Sue      B    2017-03-03    7          15.0
8      Sue      B    2017-04-04   20          13.0

And after some researching on the site and internet here is one solution:

df['PreviousMean'] = df.apply(
    lambda x: df[(df.Customer == x.Customer) & 
                 (df.Group == x.Group) & 
                 (df.Deposit_Date < x.Deposit_Date)].DPD.mean(), 
axis=1)

And it works fine. However, my actual dataframe is much larger (~1 million rows) and the above code is very slow.

I have asked a similar question before: Pandas groupby transform mean with date before current row for huge huge dataframe

except that this time the groupby is done on two columns and hence the solutions do not work and I failed to try to generalize it.
Is there any better way to do it? Thanks

>Solution :

The linked solution works fine, but you have to carefully add all the groups in groupby and then remove the matching levels in droplevel:

df['Deposit_Date'] = pd.to_datetime(df['Deposit_Date'])

groups = ['Customer', 'Group']

df['PreviousMean'] = (df.groupby(groups)
                        .apply(lambda s: s['DPD'].expanding().mean().shift()
                                                 .mask(s['Deposit_Date'].duplicated())
                                                 .ffill(),
                               include_groups=False)
                        .droplevel(groups)
                     )

Output:

  Customer Group Deposit_Date  DPD  PreviousMean
0     John     A   2017-01-01   10           NaN
1     John     A   2017-02-02   15          10.0
2     John     A   2017-02-02   20          10.0
3     John     A   2017-03-03   30          15.0
4      Sue     B   2017-01-01   10           NaN
5      Sue     B   2017-02-02   15          10.0
6      Sue     B   2017-03-02   20          12.5
7      Sue     B   2017-03-03    7          15.0
8      Sue     B   2017-04-04   20          13.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading