Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Python – aggregate groupby multiple columns, skipping nan cells

I’m trying to aggregate a dataframe accross multiple columns, grouped by Date. Some cells contain nan, which I would like to keep. So my df looks like this:

import random
import numpy as np
import pandas as pd

rng = ['2015-02-24','2015-02-24','2015-02-24','2015-02-24','2015-02-24','2015-02-24',
       '2015-02-25','2015-02-25','2015-02-25','2015-02-25','2015-02-25','2015-02-25']
rng = pd.to_datetime(rng)
Predicted = [random.randrange(-1, 50, 1) for i in range(12)]
Actual_data = [5,3,8,2,9,20, np.NaN,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]
Category = ['A','A','A','B','B','B','A','A','A','B','B','B']

df = pd.DataFrame({ 'Date': rng, 'Predicted' : Predicted, 'Actual': Actual_data, 'Category': Category})

df

          Date  Predicted  Actual Category
0   2015-02-24         0      5.0        A
1   2015-02-24         36     3.0        A
2   2015-02-24         30     8.0        A
3   2015-02-24         33     2.0        B
4   2015-02-24         49     9.0        B
5   2015-02-24         42    20.0        B
6   2015-02-25         25     NaN        A
7   2015-02-25          9     NaN        A
8   2015-02-25         21     NaN        A
9   2015-02-25         39     NaN        B
10  2015-02-25         17     NaN        B
11  2015-02-25         11     NaN        B

I Want to group by : Date and thereby summing Predicted and Actual, leaving Category out. Also, As I want to maintain the groups that only have NaN, to keep the NaN value, not transforming it to 0. So in the end I would like to have this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

          Date  Predicted  Actual 
0   2015-02-24        190      49   
1   2015-02-25        122     NaN  

I tried things like:

df = data.groupby(['Date'])['Predicted', 'Actual'].agg('sum').reset_index()

which looked like it worked, but then I saw that it transforms NaN to 0, which I don’t want. I tried it with adding skipna=False, but it says it’s not possible to add it. I did update to the latest version of pandas as I read somewhere that this might be the problem, but it didn’t matter. Could someone help me..?

>Solution :

Perhaps you could write a lambda function that returns the sum if any of the values are not null, otherwise return NaN

df.groupby(['Date'])['Predicted', 'Actual'].agg(lambda x: sum(x) if any(x) else np.nan).reset_index()

Output

        Date  Predicted  Actual
0 2015-02-24        174    47.0
1 2015-02-25        164     NaN
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading