I am looking to calculate the percentage increase or decrease between the first and last non-na value for the following dataset:
| Year | Company | Data |
|---|---|---|
| 2019 | X | 341976.00 |
| 2020 | X | 1.000 |
| 2021 | X | 282872.00 |
| 2019 | Y | NaN |
| 2020 | Y | NaN |
| 2021 | Y | NaN |
| 2019 | Z | 4394.00 |
| 2020 | Z | 173.70 |
| 2021 | Z | 518478.00 |
As I want the relative change I would expect the formula to do something like:
(last non-na value)/(first non-na value)-1
This should return something like:
| Year | Company | Data | Data |
|---|---|---|---|
| 2019 | X | 341976.00 | NaN |
| 2020 | X | 1.000 | NaN |
| 2021 | X | 282872.00 | -0.17 |
| 2019 | Y | NaN | NaN |
| 2020 | Y | NaN | NaN |
| 2021 | Y | NaN | NaN |
| 2019 | Z | 4394.00 | NaN |
| 2020 | Z | 173.70 | NaN |
| 2021 | Z | 518478.00 | 11.700 |
I have tried to combine groupby based on the company field with the first_valid_index but havent had any luck finding a solution. What is the most efficient way of calculating the relative change as above?
>Solution :
If aggregate GroupBy.first and
GroupBy.last it omit missing values, so is possible divide values and subtract 1:
s = df.groupby('Company')['Data'].agg(['last','first']).eval('last / first').sub(1)
Then found index values for last non missing values per Company:
idx = df.dropna(subset=['Data']).drop_duplicates(['Company'], keep='last').index
And mapping only matchded rows by Series.map:
df.loc[idx, 'Date'] = df.loc[idx, 'Company'].map(s)
print (df)
Year Company Data Date
0 2019 X 341976.0 NaN
1 2020 X 1.0 NaN
2 2021 X 282872.0 -0.172831
3 2019 Y NaN NaN
4 2020 Y NaN NaN
5 2021 Y NaN NaN
6 2019 Z 4394.0 NaN
7 2020 Z 173.7 NaN
8 2021 Z 518478.0 116.996814