I have a dataset of car attributes with missing values in some columns. In the Distance column for example, there are missing values and I want to replace them with the mean. There is a second column however, Car Type it shows whether the car is brand new or used. A brand new car would not have that many miles driven compared to a used car. I want to replace the NaN values in Distance with the mean of Distance values where the Car Type == 'Brand New'
Minimal setup:
df = pd.DataFrame({'Car type': ['New','Used','New','New','New','Used','New','New'],
'Distance':[20,2222,34,np.nan,np.nan,np.nan,50,10]})
print(df)
Car type Distance
0 New 20.0
1 Used 2222.0
2 New 34.0
3 New NaN
4 New NaN
5 Used NaN
6 New 50.0
7 New 10.0
>Solution :
Compute the mean for each Car Type and broadcast the values (with transform) to all rows then use fillna to replace NaN by the mean value:
df['Distance'] = (df['Distance'].fillna(df.groupby('Car type')['Distance']
.transform('mean')))
print(df)
# Output
Car type Distance
0 New 20.0
1 Used 2222.0
2 New 34.0
3 New 28.5 # mean of New car
4 New 28.5 # mean of New car
5 Used 2222.0 # mean of Used car
6 New 50.0
7 New 10.0