I’m trying to calculate the standard deviation of values from other columns, but also based on another column value, like this:
Consider this sample dataframe:
| product | january | february | march | april | sales_months |
|---|---|---|---|---|---|
| prod1 | 3 | 6 | 7 | 1 | 4 |
| prod2 | 0 | 0 | 5 | 14 | 2 |
| prod3 | 2 | 4 | 67 | 1 | 4 |
The sales_months column indicates how many months of sales the product has had. So, if a product has only 2 months of sales (because it’s a new product, as is the case with the second row) we only use the data for the last 2 months to get the standard deviation. If it has a 4 in the sales_months column, it indicates it has been sold for the whole 4 month period, so we can use the data of all 4 columns to get the standar deviation.
So, I would like to get something like this:
| product | january | february | march | april | sales_months | std_dev |
|---|---|---|---|---|---|---|
| prod1 | 3 | 6 | 7 | 1 | 4 | 2.38 |
| prod2 | 0 | 0 | 5 | 14 | 2 | 4.5 |
| prod3 | 2 | 4 | 67 | 1 | 4 | 28.02 |
How can I do this in for all rows of the dataframe. I tried building a function and applying it the dataframe, with my limited pandas knowledge, but I just caused the kernel to die:
def get_std_dev(row):
std_dev = 0.0
months = row.SALES_MONTHS
if months < 4:
m_tmp = int(months)
std_dev = df_co_info.iloc[:, 1:m_tmp].std(axis=1)
else:
std_dev = df_co_info.iloc[:, 1:4].std(axis=1)
return std_dev
df_co_info['stdDev'] = df_co_info.apply(get_std_dev, axis = 1)
>Solution :
Using numpy and a mask:
df2 = df.drop(columns=['product', 'sales_months'])
a = df2.to_numpy()
mask = np.arange(a.shape[1]) >= a.shape[1]-df['sales_months'].to_numpy()[:,None]
df['std_dev'] = df2.where(mask).std(axis=1)
output:
product january february march april sales_months std_dev
0 prod1 3 6 7 1 4 2.753785
1 prod2 0 0 5 14 2 6.363961
2 prod3 2 4 67 1 4 32.357379
masked df2:
january february march april
0 3.0 6.0 7 1
1 NaN NaN 5 14
2 2.0 4.0 67 1