Follow

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use
Contact

Trouble changing imputer strategy in scikit-learn pipeline

I am trying to use GridSearchCV to select the best imputer strategy but I am having trouble doing that.

First, I have a data preparation pipeline for numerical and categorical columns-

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline, make_pipeline

num_pipe = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipe = make_pipeline(SimpleImputer(strategy='constant', fill_value='NA'), 
                         OneHotEncoder(sparse=False, handle_unknown='ignore'))

preprocessing = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

Next, I have created a pipeline to train a support vector machine model with feature selection.

MEDevel.com: Open-source for Healthcare and Education

Collecting and validating open-source software for healthcare, education, enterprise, development, medical imaging, medical records, and digital pathology.

Visit Medevel

from sklearn.feature_selection import SelectFromModel

model = Pipeline([
    ("preprocess", preprocessing),
    ("feature_select", SelectFromModel(RandomForestRegressor(random_state=42))),
    ("regressor", SVR(kernel='rbf', C=30000.0, gamma=0.3))
])

Now, I am trying to see which imputer strategy is best for imputing missing values for numerical columns using a GridSearchCV

grid = {"model.named_steps.preprocess.transformers[0][1].named_steps['simpleimputer'].strategy": 
        ['mean','median','most_frequent']}
grid_search = GridSearchCV(model, param_grid = grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

This is where I am getting the error. The full pipeline looks like this –

Pipeline(steps=[('preprocess',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['longitude', 'latitude',
                                                   'housing_median_age',
                                                   'total_rooms',
                                                   'total_bedrooms',
                                                   'population', 'households',
                                                   'median_income']),
                                                 ('cat',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='NA',
                                                                                 strategy='constant')),
                                                                  ('onehotencoder',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['ocean_proximity'])])),
                ('feature_select',
                 SelectFromModel(estimator=RandomForestRegressor(random_state=42))),
                ('regressor', SVR(C=30000.0, gamma=0.3))])

Can anyone tell me what I need to change in the grid search to make it work?

>Solution :

The way you specify the parameter is via a dictionary that maps the name of the estimator/transformer and name of the parameter you want to change to the parameters you want to try. If you have a pipeline or a pipeline of pipelines, the name is the names of all its parents combined with a double underscore. So for your case, it looks like

gird = {
    "preprocess__num__simpleimputer__strategy":['median']
}

simpleimputer is simply the name that was automatically assigned by make_pipeline.

However, I think there are other issues in your code like fill_value=’NA’ being incorrect and actually not needed as it is not the falues to be filled but the value needed to filling missing values.

Add a comment

Leave a Reply

Keep Up to Date with the Most Important News

By pressing the Subscribe button, you confirm that you have read and are agreeing to our Privacy Policy and Terms of Use

Discover more from Dev solutions

Subscribe now to keep reading and get access to the full archive.

Continue reading