Linear Regression

Linear regression is one of the simplest Machine Learning algorithms, it simply fits a line through the data, which you can then use to predict future data points. In this project we will solve a classic introductory Machine Learning problem - we will take a dataset that describes houses and their prices, and use it to train a model that can predict prices of a house based on a few parameters(like it's size, location, average number of rooms, etc.)

Let's get started!

Importing dependencies

To complete this projects you'll need to have several libraries installed, by far the easiest way to do that is to install Anaconda. Let's import them right away to make sure everything works.

# Pandas is a data analysis library, it's like Excel in Python
import pandas as pd
# Numpy is a library for dealing with all sorts of mathy stuff, specifically arrays and matrices.
import numpy as np

# Matplotlib is a powerful plotting library
import matplotlib.pyplot as plt
# Seaborn is a data visualisation library built on top of matplotlib,
# it will help us easily draw all sorts of sexy plots and graphs and charts
import seaborn as sns

# This is just a bit of customization for seaborn that will make our graphs look prettier
sns.set_style('darkgrid')

# If you're using Jupyter Notebook, this will automatically display all the plots inside the notebook
%matplotlib inline

Loading the data

First we need to load the data. We're going to use one of the default datasets that comes with scikit learn.

# Import the dataset
from sklearn.datasets import load_boston
# This will return a dictionary containing the data and a few more things
boston = load_boston()
boston.keys()
# print(boston['DESCR'])
dict_keys(['data', 'target', 'feature_names', 'DESCR'])

This gives us a dictionary containing all the information we need. boston['data'] is just array of values containing all the features, boston['target'] is a list of corresponding house prices(what we're actually trying to predict), boston['feature_names'] contains a list of, well, feature names that will help us understand what values in boston['data'] actually mean.

You can run print(boston['DESCR']) to learn more about this dataset, the most interesting part of this description is the explanation of what each of the feature names means:

  • CRIM - per capita crime rate by town
  • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS - proportion of non-retail business acres per town
  • CHAS - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX - nitric oxides concentration (parts per 10 million)
  • RM - average number of rooms per dwelling
  • AGE - proportion of owner-occupied units built prior to 1940
  • DIS - weighted distances to five Boston employment centres
  • RAD - index of accessibility to radial highways
  • TAX - full-value property-tax rate per $10,000
  • PTRATIO - pupil-teacher ratio by town
  • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT - % lower status of the population
  • MEDV - Median value of owner-occupied homes in $1000's

Now that we have all that information, let's use pandas to put it together into a convenient table.

# DataFrame() creates a table that will contain the data values in it's cells,
# and use boston['feature_names'] to name it's columns.
df = pd.DataFrame(boston['data'], columns=boston['feature_names'])
# We can use .head(5) to look at the first 5 rows of the data frame
df.head(5)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33

Let's add one more column to our table contiaining the house prices.

df['PRICE'] = boston['target']
df.head(5)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33 36.2

Exploring the dataset

A good first step for exploring the data is to use pandas .describe() method to find out things like lowest/highest/average values in each column, standard deviations, etc.

df.describe()
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.593761 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.596783 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.647423 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000 50.000000

Now let's use Seaborn's .pairplot() feature that will plot all the features comparing them to each other. It's an easy way to get a bird's-eye overview of all the data.

Because there's a lot of variables and plotting them all against each other would create a pretty huge grid(you can try it if you want), I'm going to set parameters y_vars to price, and x_vars to the rest of the features, that way I'll only plot the price(the variable I'm interested in) against all the other features, to see how they relate to each other.

sns.pairplot(df, x_vars=df.columns, y_vars=['PRICE'])

This is pretty cool, we can immediately see how price decreases with increasing crime rate(first image), and how it increases almost linearly with the number of rooms in the house (RM).

Now we can look at these relationships closer by plotting 2 variables to compare them to each other, here we'll see how the price grows with the number of rooms:

sns.jointplot(x='RM',y='PRICE',data=df)

Preparing the data for training

Now that we've explored the data a bit, let's prepare it for training. We will want to create separate varaiables for features(the data we'll use to predict prices), and labels(the price of the house).

# By convention, X means features and y stands for labels.
# For features, we want to use all the data except for the price
# axis = 1 is a parameter we need to pass so that pandas knows we want to delete the column.
X = df.drop('PRICE', axis=1)
# For labels, we want to use just the price column
y = df['PRICE']

Then we will want to set aside a third of our data for testing, so we could use it later to test the quality of our model. SciKit learn already has a function that does this automatically.

from sklearn.model_selection import train_test_split

# This will split data into training and test sets.
# test_size tells us how much data to allocate for testing (33%), 
# and random state is just a seed to make results easier to reproduce
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Training the model

Now we're ready to import and train the model

# import the model class for LinearRegression
from sklearn.linear_model import LinearRegression
# create an object - a specific model we can actually train
model = LinearRegression()

To train the model we can simply call .fit() and pass it the data. Different models requrire different parameters, but thanks to the magic of SciKit learn, all the models can be trained using this function.

Training the model will change the model object itself, saving everything the model has learned inside it.

# train the model
model.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

Use model to predict house prices and testing the results

Now that we have a trained model, we can pass it the testing data we have set aside, and use it to generate the predictions

# Creating a list of predictions for every datapoint in testing data
predictions = model.predict(X_test)

Now we can look at our first prediction.

print('A house with these parameters: \n', X_test.loc[0])
print('\n Will cost this much:  \n',predictions[0])
A house with these parameters: 
 CRIM         0.00632
ZN          18.00000
INDUS        2.31000
CHAS         0.00000
NOX          0.53800
RM           6.57500
AGE         65.20000
DIS          4.09000
RAD          1.00000
TAX        296.00000
PTRATIO     15.30000
B          396.90000
LSTAT        4.98000
Name: 0, dtype: float64

 Will cost this much:  
 28.5408021214125

Awesome, we're now using our model to predict new data points!

Testing the model

To check the accuracy of our model, we want to use all of the testing data to find out how accurately our model will predict it, how close it's predictions will be to the actual results.

There are various ways to measure the error rate of the model, the most common one is "Root mean square error". Mean error is the average difference between what model has predicted, and what it should have predicted. We square it to make all the errors positive(otherwise predictions that are too low will compensate for predictions that are too high). Then we take the square root of the result, so that the error is measured in the units we want(in this case dollars).

from sklearn import metrics
# These functions will compare the predictions with correct results(y_test)
metrics.mean_absolute_error(y_test, predictions)
metrics.mean_squared_error(y_test, predictions)
# Take a square root to get the value in dollars
np.sqrt(metrics.mean_squared_error(y_test, predictions))
4.5549032218378915

We can also create a scatter plot to compare predictions with the correct results.

plt.scatter(y_test, predictions)

If our model would be perfect,this plot would look like a straight line(our predictions would perfectly match the correct answers). Becase the model is imperfect, the result is a bit noisy, but still good enough.

Another helpful way of evaluating our results is plotting the distribution of errors:

# Subtract the correct results from our predictions to get the error rate
sns.distplot((y_test - predictions))

When your model is trained correctly, errors will be normally distributed, centered around zero - which is what we have here. Basically this means that the majority of our predictions is correct(0), give or take some noise. This tells us that we have chosen the right model, and we can immediately see how well it predicts our results.

Conclusion

Congratulations, you have learned how to load the data, explore it, train your first model, and evaluate the results! I hope this tutorial was useful to you. You can download this post in Jupyter Notebook format here.

Receive weekly digest of my best posts!