Given a time series data as follow, I need to fill NaNs in column value based on previous value and MoM_pct:
date value MoM_pct
0 2012-1-31 17222.1 0.0019
1 2012-2-29 NaN 0.0101 --> calculated by (1+0.0101) * 17222.1
2 2012-3-31 15650.2 0.0121
3 2012-4-30 15603.1 0.0093
4 2012-5-31 NaN 0.0111 --> calculated by (1+0.0111) * 15603.1
5 2012-6-30 NaN 0.0112 --> calculated by (1+0.0111) * 15603.1 * (1+0.0112)
6 2012-7-31 16314.9 0.0103
7 2012-8-31 16658.9 0.0111
8 2012-9-30 NaN 0.0132 --> calculated by (1+0.0132) * 16658.9
9 2012-10-31 NaN 0.0118 --> calculated by (1+0.0132) * 16658.9 * (1+0.0118)
10 2012-11-30 18476.7 0.0121
11 2012-12-31 20334.2 0.0123
12 2013-1-31 NaN 0.0014 --> calculated by (1+0.0014) * 20334.2
13 2013-2-28 NaN 0.0087 --> calculated by (1+0.0014) * 20334.2 * (1+0.0087)
The expected result will like this:
date value MoM_pct
0 2012-1-31 17222.10000 0.0019
1 2012-2-29 17396.04321 0.0101
2 2012-3-31 15650.20000 0.0121
3 2012-4-30 15603.10000 0.0093
4 2012-5-31 15776.29441 0.0111
5 2012-6-30 15952.98891 0.0112
6 2012-7-31 16314.90000 0.0103
7 2012-8-31 16658.90000 0.0111
8 2012-9-30 16878.79748 0.0132
9 2012-10-31 17077.96729 0.0118
10 2012-11-30 18476.70000 0.0121
11 2012-12-31 20334.20000 0.0123
12 2013-1-31 20362.66788 0.0014
13 2013-2-28 20539.82309 0.0087
14 2013-3-31 17641.20000 0.0150
The example data:
from numpy import nan
import pandas as pd
df = pd.DataFrame({
'date': {0: '2012-1-31',
1: '2012-2-29',
2: '2012-3-31',
3: '2012-4-30',
4: '2012-5-31',
5: '2012-6-30',
6: '2012-7-31',
7: '2012-8-31',
8: '2012-9-30',
9: '2012-10-31',
10: '2012-11-30',
11: '2012-12-31',
12: '2013-1-31',
13: '2013-2-28',
14: '2013-3-31'},
'value': {0: 17222.1,
1: nan,
2: 15650.2,
3: 15603.1,
4: nan,
5: nan,
6: 16314.9,
7: 16658.9,
8: nan,
9: nan,
10: 18476.7,
11: 20334.2,
12: nan,
13: nan,
14: 17641.2},
'MoM_pct': {0: 0.0019,
1: 0.0101,
2: 0.0121,
3: 0.0093,
4: 0.0111,
5: 0.0112,
6: 0.0103,
7: 0.0111,
8: 0.0132,
9: 0.0118,
10: 0.0121,
11: 0.0123,
12: 0.0014,
13: 0.0087,
14: 0.015}
})
My code below works for the example data, but it’s not suitable if sample has more than two consecutive values are NaNs, also not concise:
df['value1'] = df['value'].shift(1)*df['MoM_pct'].add(1)
df['value2'] = df['value1'].shift(1)*df['MoM_pct'].add(1)
df['value'].fillna(df['value1'], inplace=True)
df['value'].fillna(df['value2'], inplace=True)
How could I improve my code above? Thanks.
>Solution :
We can create the multipliers from the MoM_pct column with groupby cumprod then fillna the valid values with the multiplicative identity 1 and multiply with the ffilled value column:
m = df['value'].isna()
df['value'] = (
df['value'].ffill() *
(1 + df['MoM_pct']).groupby((~m).cumsum()[m]).cumprod().fillna(1)
)
df:
date value MoM_pct
0 2012-1-31 17222.100000 0.0019
1 2012-2-29 17396.043210 0.0101
2 2012-3-31 15650.200000 0.0121
3 2012-4-30 15603.100000 0.0093
4 2012-5-31 15776.294410 0.0111
5 2012-6-30 15952.988907 0.0112
6 2012-7-31 16314.900000 0.0103
7 2012-8-31 16658.900000 0.0111
8 2012-9-30 16878.797480 0.0132
9 2012-10-31 17077.967290 0.0118
10 2012-11-30 18476.700000 0.0121
11 2012-12-31 20334.200000 0.0123
12 2013-1-31 20362.667880 0.0014
13 2013-2-28 20539.823091 0.0087
14 2013-3-31 17641.200000 0.0150
Here is the breakdown of steps as a DataFrame:
m = df['value'].isna()
print(pd.DataFrame({
'm': m,
'groups': (~m).cumsum(),
'filtered groups': (~m).cumsum()[m],
'result of cumprod': (1 + df['MoM_pct']).groupby(
(~m).cumsum()[m]
).cumprod(),
'fill missing with 1': (1 + df['MoM_pct']).groupby(
(~m).cumsum()[m]
).cumprod().fillna(1)
}))
m groups filtered groups result of cumprod fill missing with 1
0 False 1 NaN NaN 1.000000
1 True 1 1.0 1.010100 1.010100
2 False 2 NaN NaN 1.000000
3 False 3 NaN NaN 1.000000
4 True 3 3.0 1.011100 1.011100
5 True 3 3.0 1.022424 1.022424
6 False 4 NaN NaN 1.000000
7 False 5 NaN NaN 1.000000
8 True 5 5.0 1.013200 1.013200
9 True 5 5.0 1.025156 1.025156
10 False 6 NaN NaN 1.000000
11 False 7 NaN NaN 1.000000
12 True 7 7.0 1.001400 1.001400
13 True 7 7.0 1.010112 1.010112
14 False 8 NaN NaN 1.000000