Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

How to calculate new column based on values from multiple columns

I’m trying to calculate the standard deviation of values from other columns, but also based on another column value, like this:

Consider this sample dataframe:

product january february march april sales_months
prod1 3 6 7 1 4
prod2 0 0 5 14 2
prod3 2 4 67 1 4

The sales_months column indicates how many months of sales the product has had. So, if a product has only 2 months of sales (because it’s a new product, as is the case with the second row) we only use the data for the last 2 months to get the standard deviation. If it has a 4 in the sales_months column, it indicates it has been sold for the whole 4 month period, so we can use the data of all 4 columns to get the standar deviation.
So, I would like to get something like this:

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

product january february march april sales_months std_dev
prod1 3 6 7 1 4 2.38
prod2 0 0 5 14 2 4.5
prod3 2 4 67 1 4 28.02

How can I do this in for all rows of the dataframe. I tried building a function and applying it the dataframe, with my limited pandas knowledge, but I just caused the kernel to die:

def get_std_dev(row):
    std_dev = 0.0
    months = row.SALES_MONTHS
    if months < 4:
        m_tmp = int(months)
        std_dev = df_co_info.iloc[:, 1:m_tmp].std(axis=1)
    else:
        std_dev = df_co_info.iloc[:, 1:4].std(axis=1)
    return std_dev

df_co_info['stdDev'] = df_co_info.apply(get_std_dev, axis = 1)

>Solution :

Using numpy and a mask:

df2 = df.drop(columns=['product', 'sales_months'])
a = df2.to_numpy()
mask = np.arange(a.shape[1]) >= a.shape[1]-df['sales_months'].to_numpy()[:,None]

df['std_dev'] = df2.where(mask).std(axis=1)

output:

  product  january  february  march  april  sales_months    std_dev
0   prod1        3         6      7      1             4   2.753785
1   prod2        0         0      5     14             2   6.363961
2   prod3        2         4     67      1             4  32.357379

masked df2:

   january  february  march  april
0      3.0       6.0      7      1
1      NaN       NaN      5     14
2      2.0       4.0     67      1
Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading