Train-test split and cross-validation

Before training any ML model you need to set aside some of the data to be able to test how your model performs on data it hasn't seen. Doing this is a part of any machine learning project, and in this post you will learn the fundamentals of this process.

Import libraries and load data

First of all, as always, let's import our basic libraries and load the data we'll use.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
data = pd.read_csv('../data/pima-indians-diabetes.csv')
data.head()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1

And let's also separate the data into features and labels:

# All the columns except the one we want to predict
features = data.drop(['Outcome'], axis=1)
# Only the column we want to predict
labels = data['Outcome']

Simple train-test split

The most basic thing you can do is split your data into train and test datasets. We will train our model on the train dataset, and then use test dataset to evaluate the predictions our model makes. It's common to set aside one third of the data for testing.

This can easily be done using the train_test_split function:

from sklearn.model_selection import train_test_split
test_size = 0.33
seed = 12
X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=test_size,
random_state=seed)

We set test size to 33%, and we make sure to specify random seed so that the results we get would be reproducible - each time we run the code we get the same random numbers, and data will be randomly split in the same way. That way, if we'll want to train another model, we'll be able to accurately compare it to this one, because it will be trained on the same data points.

Now we can import a simple model, train it, and use the test dataset to evaluate it's results:

from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, Y_train)
result = model.score(X_test, Y_test)
print(("Accuracy: %.3f%%") % (result*100.0))
Accuracy: 75.591%

So thanks to the test dataset we know we have trained a model that has 75% accuracy.

K-fold Cross Validation

One big problem with simply doing train-test split is that you're a setting aside a chunk of your data, so you won't be able to use it to train your algorithm. And since your data is sampled at random, it has a chance of being skewed in some way, not representing the whole dataset properly.

K-fold cross validation addresses these problems. To do that, first you split the data into several (10 for example, if k = 10) subsets, called folds. Then you train and evaluate your model 10 times, setting aside each one of the folds in turn, and training the model on the remaining 9 folds.

Here's how it works:

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

num_folds = 10
seed = 7

kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
scores = cross_val_score(model, features, labels, cv=kfold)
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
Scores: [0.7012987  0.81818182 0.74025974 0.71428571 0.77922078 0.75324675
 0.85714286 0.80519481 0.72368421 0.80263158]
Mean: 0.7695146958304853
Standard deviation: 0.04841051924567195

This algorithm will return an array of 10 different performance scores, and you can summarize them by calculating their mean and standard deviation. That way you'll know the average score(which will be more accurate), and the spread of the scores.

The obvious downside of cross-validation is that you have to train your model multiple times (10 in this case), which can be very slow if your dataset is large.

Conclusion

Now you know how to split your data into training and test sets and evaluate the results.

K-fold cross validation is considered a gold standard for evaluating the performance of ML algorithms. You can use 3, 5, or 10 as a reasonable amount of folds.

If your dataset is very large and training your model becomes very slow, you can resort to simple train-test split (the more data you have, the likelier the training set is to represent the whole dataset anyways).

Receive weekly digest of my best posts!