Fine-tuning models with Grid Search

After you have prepared your data, selected and trained your model, the next step is fine-tuning it's parameters to get better results.

Unfortunately, it's really hard to figure out which parameters will lead to better results, nobody knows for sure which learning rate or number of features will produce the best results for your model on your dataset. So the best way to pick the right parameters is to experiment with different values, but doing it manually would be extremely slow and tedious.

Fortunately, there's a very convenient way to try a whole bunch of different parameters and their combinations, and see which work best.

Grid Search

Scikit-Learn's GridSearchCV class automatically searches through many combinations of the parameters, trains your model on each, and reports you the results.

All you need to do is to pass it a dictionary of different parameters and their values that you want to try out, and it will evaluate all the possible combinations using cross-validation.

Let's try it out on our familiar housing prices dataset and RandomForest model!

Load the data

As always, first we'll load our housing prices dataset and split it into features and labels:

import pandas as pd
import numpy as np

data = pd.read_csv('../data/housing.csv')
data.head()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity
0 -122.23 37.88 41.0 880.0 129.0 322.0 126.0 8.3252 452600.0 NEAR BAY
1 -122.22 37.86 21.0 7099.0 1106.0 2401.0 1138.0 8.3014 358500.0 NEAR BAY
2 -122.24 37.85 52.0 1467.0 190.0 496.0 177.0 7.2574 352100.0 NEAR BAY
3 -122.25 37.85 52.0 1274.0 235.0 558.0 219.0 5.6431 341300.0 NEAR BAY
4 -122.25 37.85 52.0 1627.0 280.0 565.0 259.0 3.8462 342200.0 NEAR BAY
# All the columns except the one we want to predict
features = data.drop(['median_house_value'], axis=1)
# Get rid of incomplete and non-numerical features so we don't have to deal
# with data preparation in this post
features = features.drop(['ocean_proximity', 'total_bedrooms'], axis=1)

# Only the column we want to predict
labels = data['median_house_value']

Now let's import and train our model to know our baseline results:

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
kfold = KFold(n_splits=5, random_state=12)
cv_scores = cross_val_score(model, features, labels, 
                            cv=kfold,
                            return_train_score=True,
                            scoring='neg_mean_squared_error')
print("Root mean squared error", np.sqrt(-cv_scores.mean()))
Root mean squared error 74518.08192158747

Now we're ready to use grid search to fine-tune our model! We will pass it two dictionaries of parameters to try out. First it'll try 2*3 combinations of n_estimators and max_features, then it'll set bootstrap to false, and try out 2*3 more combinations of n_estimators and max_features, 12 combinations in total.

Since cross-validation will train each model 5 times, we'll have 60 rounds of training.

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10], 'max_features': [2, 4, 6]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

# refit=True means grid search will retrain the best model it could find
# on the whole training set (without cross validation)
grid_search = GridSearchCV(model, param_grid, cv=2,
                           scoring='neg_mean_squared_error',
                           refit=True)

grid_search.fit(features, labels)

After our search is done, we can get the best combination of parameters like so:

grid_search.best_params_
{'bootstrap': False, 'max_features': 2, 'n_estimators': 10}

Grid search has also saved the best model it has trained:

grid_search.best_estimator_
RandomForestRegressor(bootstrap=False, criterion='mse', max_depth=None,
           max_features=2, max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False)

And we can also check out the evaluation scores for each set of parameters:

grid_search.cv_results_

Randomized search

As you saw above, grid search makes the number of training rounds explode very quickly. When you have a lot of parameters to test, it might be a good idea to use randomized search instead. It works a lot like GridSearchCV, but instead of trying out all the possible combinations it evaluates a preset amount of random combinations. This gives you more control over the time and processing power this will take.

from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

# Tell random search to sample random values from uniform distribution
param_grid = {'n_estimators': randint(2,10), 
              'max_features': randint(2,8)}

# Search through random parameters for 100 iterations
randomized_search = RandomizedSearchCV(model, param_grid, n_iter=10,
                                 scoring='neg_mean_squared_error',
                                 refit=True)

randomized_search.fit(features, labels)
RandomizedSearchCV(cv=None, error_score='raise',
          estimator=RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=None, verbose=0, warm_start=False),
          fit_params=None, iid=True, n_iter=10, n_jobs=1,
          param_distributions={'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd01bfa2f28>, 'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7fd01bfa2780>},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score='warn', scoring='neg_mean_squared_error',
          verbose=0)

Excellent! Just as before, we can take the resulting values like so:

print(randomized_search.best_score_)
print(randomized_search.best_estimator_.n_estimators)
print(randomized_search.best_estimator_.max_features)

Excellent, so we have learned to fine-tune our models using Grid Search and Randomized Search, and now we can always use them to find the optimal parameters for our models!

Receive weekly digest of my best posts!