Fine-tuning models with Grid Search
After you have prepared your data, selected and trained your model, the next step is fine-tuning it's parameters to get better results.
Unfortunately, it's really hard to figure out which parameters will lead to better results, nobody knows for sure which learning rate or number of features will produce the best results for your model on your dataset. So the best way to pick the right parameters is to experiment with different values, but doing it manually would be extremely slow and tedious.
Fortunately, there's a very convenient way to try a whole bunch of different parameters and their combinations, and see which work best.
Grid Search
Scikit-Learn's GridSearchCV
class automatically searches through many combinations of the parameters, trains your model on each, and reports you the results.
All you need to do is to pass it a dictionary of different parameters and their values that you want to try out, and it will evaluate all the possible combinations using cross-validation.
Let's try it out on our familiar housing prices dataset and RandomForest model!
Load the data
As always, first we'll load our housing prices dataset and split it into features and labels:
import pandas as pd
import numpy as np
data = pd.read_csv('../data/housing.csv')
data.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
# All the columns except the one we want to predict
features = data.drop(['median_house_value'], axis=1)
# Get rid of incomplete and non-numerical features so we don't have to deal
# with data preparation in this post
features = features.drop(['ocean_proximity', 'total_bedrooms'], axis=1)
# Only the column we want to predict
labels = data['median_house_value']
Now let's import and train our model to know our baseline results:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
kfold = KFold(n_splits=5, random_state=12)
cv_scores = cross_val_score(model, features, labels,
cv=kfold,
return_train_score=True,
scoring='neg_mean_squared_error')
print("Root mean squared error", np.sqrt(-cv_scores.mean()))
Root mean squared error 74518.08192158747
Now we're ready to use grid search to fine-tune our model! We will pass it two dictionaries of parameters to try out. First it'll try 2*3 combinations of n_estimators
and max_features
, then it'll set bootstrap
to false, and try out 2*3 more combinations of n_estimators
and max_features
, 12 combinations in total.
Since cross-validation will train each model 5 times, we'll have 60 rounds of training.
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10], 'max_features': [2, 4, 6]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
# refit=True means grid search will retrain the best model it could find
# on the whole training set (without cross validation)
grid_search = GridSearchCV(model, param_grid, cv=2,
scoring='neg_mean_squared_error',
refit=True)
grid_search.fit(features, labels)
After our search is done, we can get the best combination of parameters like so:
grid_search.best_params_
{'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
Grid search has also saved the best model it has trained:
grid_search.best_estimator_
RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,
max_features=2, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
And we can also check out the evaluation scores for each set of parameters:
grid_search.cv_results_
Randomized search
As you saw above, grid search makes the number of training rounds explode very quickly. When you have a lot of parameters to test, it might be a good idea to use randomized search instead. It works a lot like GridSearchCV
, but instead of trying out all the possible combinations it evaluates a preset amount of random combinations. This gives you more control over the time and processing power this will take.
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
# Tell random search to sample random values from uniform distribution
param_grid = {'n_estimators': randint(2,10),
'max_features': randint(2,8)}
# Search through random parameters for 100 iterations
randomized_search = RandomizedSearchCV(model, param_grid, n_iter=10,
scoring='neg_mean_squared_error',
refit=True)
randomized_search.fit(features, labels)
RandomizedSearchCV(cv=None, error_score='raise',
estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0, warm_start=False),
fit_params=None, iid=True, n_iter=10, n_jobs=1,
param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd01bfa2f28>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd01bfa2780>},
pre_dispatch='2*n_jobs', random_state=None, refit=True,
return_train_score='warn', scoring='neg_mean_squared_error',
verbose=0)
Excellent! Just as before, we can take the resulting values like so:
print(randomized_search.best_score_)
print(randomized_search.best_estimator_.n_estimators)
print(randomized_search.best_estimator_.max_features)
Excellent, so we have learned to fine-tune our models using Grid Search and Randomized Search, and now we can always use them to find the optimal parameters for our models!