I have this dataframe:
date,qq,q_t
1956-01-01, 1, 4
1956-01-02, 2, 5
1956-01-03, 3, 1
1956-01-04, 4, 1
1956-01-05, 5, 1
1956-01-06, 6, 10
1956-01-07, 7, 11
1956-01-08, 8, 12
1956-01-09, 9, 5
1956-01-10, 10, 3
1956-01-11, 11, 3
1956-01-12, 12, 3
1956-01-13, 13, 50
1956-01-14, 14, 51
1956-01-15, 15, 52
1956-01-16, 16, 53
1956-01-17, 17, 1
1956-01-18, 18, 23
1956-01-19, 19, 1
As a first step I compute the difference between ‘qq’ and ‘qt’. Then I compute the number of consecutive days when the difference between ‘qq’ and ‘qt’ is positive as:
dfr = pd.read_csv('test.csv', sep=',',index_col=0,parse_dates=True)
dfr['diff'] = dfr['q_t'] - dfr['qq']
dfr['test'] = np.where(dfr['diff']>=0, True, False)
dfr['diff'] = np.where(dfr['diff']<0 , np.nan, dfr['diff'])
dfr['group'] = (dfr['test'] != dfr['test'].shift()).cumsum()
dfr.reset_index(inplace=True)
true_values = dfr[dfr['test']]
consecutive_days = true_values.groupby('group')['date'].agg(['min', 'max', 'count'])
As it could be noticed, these are the resulting data-frames:
dfr as:
date qq q_t diff test group
0 1956-01-01 1 4 3.0 True 1
1 1956-01-02 2 5 3.0 True 1
2 1956-01-03 3 1 False 2
3 1956-01-04 4 1 False 2
4 1956-01-05 5 1 False 2
5 1956-01-06 6 10 4.0 True 3
6 1956-01-07 7 11 4.0 True 3
7 1956-01-08 8 12 4.0 True 3
8 1956-01-09 9 5 False 4
9 1956-01-10 10 3 False 4
10 1956-01-11 11 3 False 4
11 1956-01-12 12 3 False 4
12 1956-01-13 13 50 37.0 True 5
13 1956-01-14 14 51 37.0 True 5
14 1956-01-15 15 52 37.0 True 5
15 1956-01-16 16 53 37.0 True 5
16 1956-01-17 17 1 False 6
17 1956-01-18 18 23 5.0 True 7
18 1956-01-19 19 1 False 8
and consecutive_days as
group min max count
1 1956-01-01 1956-01-02 2
3 1956-01-06 1956-01-08 3
5 1956-01-13 1956-01-16 4
7 1956-01-18 1956-01-18 1
I would like to have another column with the sum of all the True value with that belongs to the same group. In other words. i would like get the following result:
group min max count sum
1 1956-01-01 1956-01-02 2 6
3 1956-01-06 1956-01-08 3 12
5 1956-01-13 1956-01-16 4 148
7 1956-01-18 1956-01-18 1 5
My original idea was to work on dfr as
vol = dfr.groupby(by=['group']).sum(min_count=1)
This just before resetting the index. After that I could remove the lines with nan.
However, I would like to take advance of the first "group by". Is it possible to use two operation in the same "group by"?
>Solution :
You can do multiple aggregations in the first groupby operation
consecutive_days = (
true_values
.groupby('group')
.agg({'date': ['min', 'max', 'count'], 'diff': 'sum'})
.droplevel(0, axis=1)
)
min max count sum
group
1 1956-01-01 1956-01-02 2 6.0
3 1956-01-06 1956-01-08 3 12.0
5 1956-01-13 1956-01-16 4 148.0
7 1956-01-18 1956-01-18 1 5.0