I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.
First, I have a data preparation pipeline for numerical and categorical columns-
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline
num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'),
OneHotEncoder(sparse=False, handle_unknown='ignore'))
preprocessing = ColumnTransformer([
("num", num_pipe, num_cols),
("cat", cat_pipe, cat_cols)
])
Next, I have created a pipeline to train a support vector machine model with feature selection.
from sklearn.feature_selection import SelectFromModel
model = Pipeline([
("preprocess", preprocessing),
("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
])
Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV
grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy":
['mean','median','most_frequent']}
grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)
This is where I am getting the error. The full pipeline looks like this –
Pipeline(steps=[('preprocess',
ColumnTransformer(transformers=[('num',
Pipeline(steps=[('simpleimputer',
SimpleImputer(strategy='median')),
('standardscaler',
StandardScaler())]),
['longitude', 'latitude',
'housing_median_age',
'total_rooms',
'total_bedrooms',
'population', 'households',
'median_income']),
('cat',
Pipeline(steps=[('simpleimputer',
SimpleImputer(fill_value='NA',
strategy='constant')),
('onehotencoder',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['ocean_proximity'])])),
('feature_select',
SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
('regressor', SVR(C=30000.0, gamma=0.3))])
Can anyone tell me what I need to change in the grid search to make it work?
>Solution :
The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try. If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore. So for your case, it looks like
gird = {
"preprocess__num__simpleimputer__strategy":['median']
}
simpleimputer is simply the name that was automatically assigned by make_pipeline.
However, I think there are other issues in your code like fill_value=’NA’ being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.