Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

group by with two operation

I have this dataframe:

date,qq,q_t
1956-01-01,  1,   4
1956-01-02,  2,   5
1956-01-03,  3,   1
1956-01-04,  4,   1
1956-01-05,  5,   1
1956-01-06,  6,   10
1956-01-07,  7,   11
1956-01-08,  8,   12
1956-01-09,  9,   5
1956-01-10,  10,  3
1956-01-11,  11,  3
1956-01-12,  12,  3
1956-01-13,  13,  50
1956-01-14,  14,  51
1956-01-15,  15,  52
1956-01-16,  16,  53
1956-01-17,  17,  1
1956-01-18,  18,  23
1956-01-19,  19,  1

As a first step I compute the difference between ‘qq’ and ‘qt’. Then I compute the number of consecutive days when the difference between ‘qq’ and ‘qt’ is positive as:

dfr   = pd.read_csv('test.csv', sep=',',index_col=0,parse_dates=True)


dfr['diff'] = dfr['q_t'] - dfr['qq']

dfr['test'] = np.where(dfr['diff']>=0, True, False)



dfr['diff'] = np.where(dfr['diff']<0 , np.nan, dfr['diff'])

dfr['group'] = (dfr['test'] != dfr['test'].shift()).cumsum()

dfr.reset_index(inplace=True)

true_values = dfr[dfr['test']]

consecutive_days = true_values.groupby('group')['date'].agg(['min', 'max', 'count'])

As it could be noticed, these are the resulting data-frames:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

dfr as:

    date    qq  q_t diff    test    group
0   1956-01-01  1   4   3.0 True    1
1   1956-01-02  2   5   3.0 True    1
2   1956-01-03  3   1       False   2
3   1956-01-04  4   1       False   2
4   1956-01-05  5   1       False   2
5   1956-01-06  6   10  4.0 True    3
6   1956-01-07  7   11  4.0 True    3
7   1956-01-08  8   12  4.0 True    3
8   1956-01-09  9   5       False   4
9   1956-01-10  10  3       False   4
10  1956-01-11  11  3       False   4
11  1956-01-12  12  3       False   4
12  1956-01-13  13  50  37.0    True    5
13  1956-01-14  14  51  37.0    True    5
14  1956-01-15  15  52  37.0    True    5
15  1956-01-16  16  53  37.0    True    5
16  1956-01-17  17  1       False   6
17  1956-01-18  18  23  5.0 True    7
18  1956-01-19  19  1       False   8

and consecutive_days as

group   min max count
1   1956-01-01  1956-01-02  2
3   1956-01-06  1956-01-08  3
5   1956-01-13  1956-01-16  4
7   1956-01-18  1956-01-18  1

I would like to have another column with the sum of all the True value with that belongs to the same group. In other words. i would like get the following result:

group   min max count  sum
1   1956-01-01  1956-01-02  2    6
3   1956-01-06  1956-01-08  3   12
5   1956-01-13  1956-01-16  4  148
7   1956-01-18  1956-01-18  1    5

My original idea was to work on dfr as

vol = dfr.groupby(by=['group']).sum(min_count=1)

This just before resetting the index. After that I could remove the lines with nan.

However, I would like to take advance of the first "group by". Is it possible to use two operation in the same "group by"?

>Solution :

You can do multiple aggregations in the first groupby operation

consecutive_days = (
    true_values
    .groupby('group')
    .agg({'date': ['min', 'max', 'count'], 'diff': 'sum'})
    .droplevel(0, axis=1)
)

              min         max  count    sum
group                                      
1      1956-01-01  1956-01-02      2    6.0
3      1956-01-06  1956-01-08      3   12.0
5      1956-01-13  1956-01-16      4  148.0
7      1956-01-18  1956-01-18      1    5.0
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading