When faced with a large dataset, I need to spend a day using
GridSearchCV() to train an SVM with the best parameters. How can I save the best estimator so that I can use this trained estimator directly when I start my computer next time?
By default, GridSearchCV does not expose or store the best model instance it only returns the parameter set that led to the highest score. If you want the best predictor, you have to specify
refit=True, or if you are using multiple metrics
refit=name-of-your-decider-metric. This will run a final training step using the full dataset and the best parameters found. To find the optimal parameters, GridSearchCv obviously does not use the entire dataset for training, as they have to split out the hold-out validation set.
Now, when you do that, you can get the model via the
best_estimator_ attribute. Having this, you can pickel that model using joblib and reload it the next day to do your prediction. In a mix of pseudo and real code, that would read like
from joblib import dump, load svc = svm.SVC() # Probably not what you are using, but just as an example gcv = GridSearchCv(svc, parameters, refit=True) gvc.fit(X, y) estimator = gcv.best_estimator_ dump(estimator, "your-model.joblib") # Somewhere else estimator = load("your-model.joblib")