First Machine Learning Project
This is a beginner-friendly step-by-step tutorial where we will put all the pieces of Machine Learning process together, and complete an end-to-end Machine Learning project.
Our task will be to create a model capable of determining which of the three species a flower belongs to based on it's description. It is a classic Machine Learning challenge that's perfect for beginners. Iris Flowers Dataset is a small, well-researched dataset that consists of 4 features describing each flower(length and width of it's Sepal and Petal), and the flower's species(one of three - Iris Setosa, Iris Versicolor, or Iris Virginica).
Here's our plan:
- Install and import the necessary libraries
- Download and load the dataset
- Analyze it using statistical tools and visualisation
- Train and evaluate a few ML models, comparing their performance
- Tweak the best one to improve our results
As a result we will have a complete model capable of making accurate predictions. Let's get started!
Install and import the libraries
First we need to make sure you have all the tools necessary for this project. We will need:
- Jupyter Lab - an interactive environment Machine Learning researchers use for their projects.
- Pandas - a data analysis library (kind of like Excel in Python).
- Numpy - a library for dealing with all sorts of mathy stuff, specifically arrays and matrices.
- Matplotlib - a powerful plotting library.
- Seaborn - a data visualisation library built on top of matplotlib (it will help us easily draw all sorts of sexy plots and graphs and charts)
- Scikit-learn - a powerful Machine Learning library making it easy to train and analyze all sorts of ML algorithms.
The easiest way to make sure you have all of this properly installed and set up is to download and install Anaconda - a popular Python Data Science platform that comes with all the necessary tools and makes it very easy to install new ones.
Whenever you're missing a tool, you'll be able to install it very easily like this:
conda install -c anaconda seaborn
Once your setup is complete, create a new jupyter notebook and import all the basic libraries:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# This is just a bit of customization for seaborn that will make our graphs look prettier
sns.set_style('darkgrid')
# If you're using Jupyter Notebook, this will automatically display all the plots inside the notebook
%matplotlib inline
If you have not received any errors that means your setup is ready, everything is installed, and you're ready to go!
Downloading and loading the dataset
Inside my projects directory I prefer to have a notebooks
folder containing notebooks with my experiments, and a data
folder right next to it containing the datasets. Download the Iris dataset and put the csv file into the data
folder. Now you can load it using Pandas:
# Load the data, use the first column as index.
data = pd.read_csv('../data/Iris.csv',index_col=0)
data.head(5)
SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | Species | |
---|---|---|---|---|---|
Id | |||||
1 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Excellent! Data is loaded successfully, and data.head(5)
is a Pandas function that shows the first 5 rows of the dataset. As you can see, it contains Sepal/Petal lengths and widths of many flowers, and their corresponding species.
Let's analyze the dataset to see what else we can learn just by looking at the data!
Data Analysis and Visualisation
To find out the size of our dataset is we can use data.info()
:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 1 to 150
Data columns (total 5 columns):
SepalLengthCm 150 non-null float64
SepalWidthCm 150 non-null float64
PetalLengthCm 150 non-null float64
PetalWidthCm 150 non-null float64
Species 150 non-null object
dtypes: float64(4), object(1)
memory usage: 7.0+ KB
As you can see, it contains the 5 columns (4 for features describing the flower measurments, and 1 for labels describing it's species). Also we can see that all features have the same number of values (150), that means that none of the data is missing. Great!
Now let's look at the statistical summary of our data using .describe()
function:
data.describe()
SepalLengthCm | SepalWidthCm | PetalLengthCm | PetalWidthCm | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
As you can see it shows us:
- count - the number of datapoints in the column
- mean - the average value
- std - the standard deviation, showing you how dispersed the values are, the average distance between data points and the mean
- min and max are self explanatory
- 25%,50%, and 75% rows show the corresponding percentiles.
A percentile shows the value below which a given percentage of observations falls. For example, 25% sepal lengths are smaller than 5.1cm, 50% are smaller than 5.8cm and 75% are smaller than 6.4cm.
Now let's look at so called "class distribution" - how many flowers of each species are there?
data.groupby( 'Species' ).size()
Species
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
Looks like we have 50 flowers of each species.
Now it's time to try some data visualisation! The simplest one we can use is histagram - showing us distribution of each value:
# Setting figsize makes the image a little larger
# Semicolon at the end of the command prevents Jupyter Notebook from
# displaying text output before the image, making it a little prettier
data.hist(figsize=(12,8));
We can also create a boxplot:
sns.boxplot(data=data);
Box plots conveniently summarize the distribution of each variable. The line in the middle of the box represents the median value(data point that's in the middle), and box around it shows 25th and 75th percentiles(middle 50% of the data). "Whiskers" show the rest of the distribution, and points outside of them show the outliers.
We can also compare, let's say, sepal lengths of all species like so:
sns.boxplot(x='Species',y='SepalLengthCm', data=data);
One more extremely useful tool for understanding the data is pair plot. It plots every feature in the dataset against every other feature:
sns.pairplot(data=data, hue='Species');
We have colored all the datapoints according to their species, and we can see that similarly colored points are grouped together into little clusters - that's great news, because it means that our algorithms are likely to be able to easily separate them.
To avoid comparing each variable to itself(which would be pointless, the main diagonal would be full of straight lines), diagonal shows the histogram of each variable.
Excellent, so that's that for our data analysis! It's time to select and train a few algorithms!
Training and evaluating algorithms
Before we train anything, we first need to set aside some of our data to use as a test set, it will allow us to measure how well our algorithm performs on an unseen data. Doing it is easy, Scikit-learn has a train_test_split
function that takes features, labels, and how much data you want to set aside, and returns training and testing datasets:
# Separate our data into features and labels
# axis=1 tells pandas that we want to drop the column containing our labels
features = data.drop('Species', axis=1)
# Only the column containing our labels
labels = data['Species']
from sklearn.model_selection import train_test_split
# Random seed we can use to make our results reproducible
seed = 10
# We want to set aside one third of the data
test_size = 0.33
# X stands for features, Y stands for labels
X_train, X_test, Y_train, Y_test = train_test_split(features, labels,test_size=test_size, random_state=seed)
print("Training features size",X_train.shape)
print("Training labels size",Y_train.shape)
print("Testing features size",X_test.shape)
print("Testing labels size",Y_test.shape)
Training features size (100, 4)
Training labels size (100,)
Testing features size (50, 4)
Testing labels size (50,)
Now, it's time to select our models. There are a lot of models that could be applied, and we can't know in advance which one will perform better, so we will try several of them:
- Logistic Regression.
- Linear Discriminant Analysis.
- k-Nearest Neighbors.
- Decision Trees
- Gaussian Naive Bayes.
- Support Vector Machines.
When you need help deciding what algorithms to try for your problem, consult this guide.
We will import and create a list of models we want to try:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
models = []
models.append(( 'LogisticRegression' , LogisticRegression()))
models.append(( 'LDA' , LinearDiscriminantAnalysis()))
models.append(( 'KNN' , KNeighborsClassifier()))
models.append(( 'DecisionTree' , DecisionTreeClassifier()))
models.append(( 'NaiveBayes' , GaussianNB()))
models.append(( 'SVM' , SVC()))
Now loop through them training each one separately and evaluating the results:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# evaluate each model in turn
scores = []
names = []
for name, model in models:
kfold = KFold(n_splits=5, random_state=seed)
cv_score = cross_val_score(model, X_train, Y_train,
cv=kfold, scoring= 'accuracy' )
scores.append(cv_score)
names.append(name)
print(name)
print("Mean:",cv_score.mean(), "Standard Deviation:", cv_score.std())
LogisticRegression
Mean: 0.95 Standard Deviation: 0.05477225575051661
LDA
Mean: 0.97 Standard Deviation: 0.024494897427831803
KNN
Mean: 0.95 Standard Deviation: 0.05477225575051661
DecisionTree
Mean: 0.95 Standard Deviation: 0.04472135954999579
NaiveBayes
Mean: 0.9399999999999998 Standard Deviation: 0.037416573867739396
SVM
Mean: 0.97 Standard Deviation: 0.039999999999999994
We have evaluated each model using so called K-Fold cross-validation. Basically it splits our data into several (five in our case) parts (called "folds"), and trains an algorithm 5 times setting aside each one of them in turn, and training it on the rest.
This allows us to train our algorithm on all the available data, without wasting the chunk we've set aside for testing. Besides, if our randomly selected data is skewed in some way, this process makes up for that.
As a result, we have a mean (average) accuracy score, and it's standard deviation (how much it is spread out, how noisy it is).
Looks like Linear Discriminant Analysis and SVM models created the best results in our case. 97% accuracy - not bad!
Make predictions and evaluate our results
Now it's time for our final test - checking the accuracy of our model on our validation set. This will let us know how our algorithm will perform on real data it has never seen before.
Let's train our model on the whole training dataset, and evaluate the results:
from sklearn.metrics import accuracy_score
# Fit our model to training data
model = SVC()
model.fit(X_train, Y_train)
# Generate our predictions
predictions = model.predict(X_test)
print("Accuracy:",accuracy_score(Y_test, predictions))
Accuracy: 0.98
Wow, that's fantastic! Our model has performed with 98% accuracy.
Let's create a confusion matrix, it will show us how many instances of each class were classified correctly and how many weren't. That way we'll be able to know which mistakes our model has made:
from sklearn.metrics import confusion_matrix
cf = confusion_matrix(Y_test, predictions)
species = ['Iris Setosa', 'Iris Versicolor', 'Iris Virginica']
df = pd.DataFrame(cf, index=species, columns=species)
pd.concat(
[pd.concat([df], keys=['Actual Class'], axis=1)],
keys=['Predicted Class']
)
Actual Class | ||||
---|---|---|---|---|
Iris Setosa | Iris Versicolor | Iris Virginica | ||
Predicted Class | Iris Setosa | 15 | 0 | 0 |
Iris Versicolor | 0 | 18 | 1 | |
Iris Virginica | 0 | 0 | 16 |
The diagonal values show us how many values were classified correctly, the other ones show us which instances were misclassified. In our case, only one flower of Iris Virginica species was mistakenly classified as Iris Versicolor.
That's it! You have completed your first Machine Learning project with pretty impressive results! I hope this tutorial was useful to you, subscribe to my blog for more posts like this, and check out the other articles I have written!