Home Select certain columns, fill missing values in these columns with the average of columns with not-na values

Questions

Select certain columns, fill missing values in these columns with the average of columns with not-na values

August 26, 2022

Given a dataframe df as follows:

         date      value        pred1        pred2        pred3
0   2021-9-19  3613.9663          NaN          NaN          NaN
1   2021-9-26  3613.0673          NaN          NaN          NaN
2   2021-10-3  3568.1668          NaN          NaN          NaN
3  2021-10-10  3592.1666  3631.411799          NaN          NaN
4  2021-10-17  3572.3662  3637.792491  3677.656329          NaN
5  2021-10-24  3582.6036  3678.800911  3707.926324  3712.669694
6  2021-10-31  3547.3361  3731.336899  3735.695071  3733.021293
7   2021-11-7  3491.5677  3749.422639  3743.507496          NaN
8  2021-11-14  3539.1002  3737.851787          NaN          NaN
9  2021-11-21  3560.3734          NaN          NaN          NaN

Suppose for columns pred1, pred2 and pred3, except that all these columns are NaNs, if it has one non-null value, I will populate the other columns with that value, and if it has two or more non-null values, I will use the average of non-null values to fill null values of other columns.

For example, for the row with date 2021-10-10, I will populate pred2 and pred3 with the value of pred1; for the row with date 2021-10-17, I will populate pred3 with the mean of pred1 and pred2 on that day; The same logic will apply to 2021-11-7 and 2021-11-14.

The expected result:

         date      value        pred1        pred2        pred3
0   2021-9-19  3613.9663          NaN          NaN          NaN
1   2021-9-26  3613.0673          NaN          NaN          NaN
2   2021-10-3  3568.1668          NaN          NaN          NaN
3  2021-10-10  3592.1666  3631.411799  3631.411799  3631.411799
4  2021-10-17  3572.3662  3637.792491  3677.656329  3657.724410
5  2021-10-24  3582.6036  3678.800911  3707.926324  3712.669694
6  2021-10-31  3547.3361  3731.336899  3735.695071  3733.021293
7   2021-11-7  3491.5677  3749.422639  3743.507496     2.957571
8  2021-11-14  3539.1002  3737.851787  3737.851787  3737.851787
9  2021-11-21  3560.3734          NaN          NaN          NaN

Anyone could help to deal this issue? Thanks.

>Solution :

df.iloc[:,2:].apply(lambda x: x.fillna(x.mean()), axis = 1)
 
         pred1        pred2        pred3
0          NaN          NaN          NaN
1          NaN          NaN          NaN
2          NaN          NaN          NaN
3  3631.411799  3631.411799  3631.411799
4  3637.792491  3677.656329  3657.724410
5  3678.800911  3707.926324  3712.669694
6  3731.336899  3735.695071  3733.021293
7  3749.422639  3743.507496  3746.465068
8  3737.851787  3737.851787  3737.851787
9          NaN          NaN          NaN