Train-test split and cross-validation
Before training any ML model you need to set aside some of the data to be able to test how your model performs on data it hasn't seen. Doing this is a part of any machine learning project, and in this post you will learn the fundamentals of this process.
Import libraries and load data
First of all, as always, let's import our basic libraries and load the data we'll use.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
data = pd.read_csv('../data/pima-indians-diabetes.csv') data.head()
And let's also separate the data into features and labels:
# All the columns except the one we want to predict features = data.drop(['Outcome'], axis=1) # Only the column we want to predict labels = data['Outcome']
Simple train-test split
The most basic thing you can do is split your data into train and test datasets. We will train our model on the train dataset, and then use test dataset to evaluate the predictions our model makes. It's common to set aside one third of the data for testing.
This can easily be done using the
from sklearn.model_selection import train_test_split test_size = 0.33 seed = 12 X_train, X_test, Y_train, Y_test = train_test_split(features, labels, test_size=test_size, random_state=seed)
We set test size to 33%, and we make sure to specify random seed so that the results we get would be reproducible - each time we run the code we get the same random numbers, and data will be randomly split in the same way. That way, if we'll want to train another model, we'll be able to accurately compare it to this one, because it will be trained on the same data points.
Now we can import a simple model, train it, and use the test dataset to evaluate it's results:
from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(X_train, Y_train) result = model.score(X_test, Y_test) print(("Accuracy: %.3f%%") % (result*100.0))
So thanks to the test dataset we know we have trained a model that has 75% accuracy.
K-fold Cross Validation
One big problem with simply doing train-test split is that you're a setting aside a chunk of your data, so you won't be able to use it to train your algorithm. And since your data is sampled at random, it has a chance of being skewed in some way, not representing the whole dataset properly.
K-fold cross validation addresses these problems. To do that, first you split the data into several (10 for example, if k = 10) subsets, called folds. Then you train and evaluate your model 10 times, setting aside each one of the folds in turn, and training the model on the remaining 9 folds.
Here's how it works:
from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression num_folds = 10 seed = 7 kfold = KFold(n_splits=num_folds, random_state=seed) model = LogisticRegression() scores = cross_val_score(model, features, labels, cv=kfold) print("Scores:", scores) print("Mean:", scores.mean()) print("Standard deviation:", scores.std())
Scores: [0.7012987 0.81818182 0.74025974 0.71428571 0.77922078 0.75324675 0.85714286 0.80519481 0.72368421 0.80263158] Mean: 0.7695146958304853 Standard deviation: 0.04841051924567195
This algorithm will return an array of 10 different performance scores, and you can summarize them by calculating their mean and standard deviation. That way you'll know the average score(which will be more accurate), and the spread of the scores.
The obvious downside of cross-validation is that you have to train your model multiple times (10 in this case), which can be very slow if your dataset is large.
Now you know how to split your data into training and test sets and evaluate the results.
K-fold cross validation is considered a gold standard for evaluating the performance of ML algorithms. You can use 3, 5, or 10 as a reasonable amount of folds.
If your dataset is very large and training your model becomes very slow, you can resort to simple train-test split (the more data you have, the likelier the training set is to represent the whole dataset anyways).