Home Calculating the difference between the first non-na value and the last na-value based on a grouped condition

Questions

Calculating the difference between the first non-na value and the last na-value based on a grouped condition

November 30, 2022

I am looking to calculate the percentage increase or decrease between the first and last non-na value for the following dataset:

Year	Company	Data
2019	X	341976.00
2020	X	1.000
2021	X	282872.00
2019	Y	NaN
2020	Y	NaN
2021	Y	NaN
2019	Z	4394.00
2020	Z	173.70
2021	Z	518478.00

As I want the relative change I would expect the formula to do something like:

(last non-na value)/(first non-na value)-1

This should return something like:

Year	Company	Data	Data
2019	X	341976.00	NaN
2020	X	1.000	NaN
2021	X	282872.00	-0.17
2019	Y	NaN	NaN
2020	Y	NaN	NaN
2021	Y	NaN	NaN
2019	Z	4394.00	NaN
2020	Z	173.70	NaN
2021	Z	518478.00	11.700

I have tried to combine groupby based on the company field with the first_valid_index but havent had any luck finding a solution. What is the most efficient way of calculating the relative change as above?

>Solution :

If aggregate GroupBy.first and
GroupBy.last it omit missing values, so is possible divide values and subtract 1:

s = df.groupby('Company')['Data'].agg(['last','first']).eval('last / first').sub(1)

Then found index values for last non missing values per Company:

idx = df.dropna(subset=['Data']).drop_duplicates(['Company'], keep='last').index

And mapping only matchded rows by Series.map:

df.loc[idx, 'Date'] = df.loc[idx, 'Company'].map(s)
print (df)

   Year Company      Data        Date
0  2019       X  341976.0         NaN
1  2020       X       1.0         NaN
2  2021       X  282872.0   -0.172831
3  2019       Y       NaN         NaN
4  2020       Y       NaN         NaN
5  2021       Y       NaN         NaN
6  2019       Z    4394.0         NaN
7  2020       Z     173.7         NaN
8  2021       Z  518478.0  116.996814