Linear regression is one of the simplest Machine Learning algorithms, it simply fits a line through the data, which you can then use to predict future data points. In this project we will solve a classic introductory Machine Learning problem - we will take a dataset that describes houses and their prices, and use it to train a model that can predict prices of a house based on a few parameters(like it's size, location, average number of rooms, etc.)
Let's get started!
To complete this projects you'll need to have several libraries installed, by far the easiest way to do that is to install Anaconda. Let's import them right away to make sure everything works.
# Pandas is a data analysis library, it's like Excel in Python import pandas as pd # Numpy is a library for dealing with all sorts of mathy stuff, specifically arrays and matrices. import numpy as np # Matplotlib is a powerful plotting library import matplotlib.pyplot as plt # Seaborn is a data visualisation library built on top of matplotlib, # it will help us easily draw all sorts of sexy plots and graphs and charts import seaborn as sns # This is just a bit of customization for seaborn that will make our graphs look prettier sns.set_style('darkgrid') # If you're using Jupyter Notebook, this will automatically display all the plots inside the notebook %matplotlib inline
First we need to load the data. We're going to use one of the default datasets that comes with scikit learn.
# Import the dataset from sklearn.datasets import load_boston # This will return a dictionary containing the data and a few more things boston = load_boston() boston.keys() # print(boston['DESCR'])
dict_keys(['data', 'target', 'feature_names', 'DESCR'])
This gives us a dictionary containing all the information we need.
boston['data'] is just array of values containing all the features,
boston['target'] is a list of corresponding house prices(what we're actually trying to predict),
boston['feature_names'] contains a list of, well, feature names that will help us understand what values in
boston['data'] actually mean.
You can run
print(boston['DESCR']) to learn more about this dataset, the most interesting part of this description is the explanation of what each of the feature names means:
- CRIM - per capita crime rate by town
- ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS - proportion of non-retail business acres per town
- CHAS - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX - nitric oxides concentration (parts per 10 million)
- RM - average number of rooms per dwelling
- AGE - proportion of owner-occupied units built prior to 1940
- DIS - weighted distances to five Boston employment centres
- RAD - index of accessibility to radial highways
- TAX - full-value property-tax rate per
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in
Now that we have all that information, let's use pandas to put it together into a convenient table.
# DataFrame() creates a table that will contain the data values in it's cells, # and use boston['feature_names'] to name it's columns. df = pd.DataFrame(boston['data'], columns=boston['feature_names']) # We can use .head(5) to look at the first 5 rows of the data frame df.head(5)
Let's add one more column to our table contiaining the house prices.
df['PRICE'] = boston['target'] df.head(5)
A good first step for exploring the data is to use pandas
.describe() method to find out things like lowest/highest/average values in each column, standard deviations, etc.
Now let's use Seaborn's
.pairplot() feature that will plot all the features comparing them to each other. It's an easy way to get a bird's-eye overview of all the data.
Because there's a lot of variables and plotting them all against each other would create a pretty huge grid(you can try it if you want), I'm going to set parameters
y_vars to price, and
x_vars to the rest of the features, that way I'll only plot the price(the variable I'm interested in) against all the other features, to see how they relate to each other.
sns.pairplot(df, x_vars=df.columns, y_vars=['PRICE'])
This is pretty cool, we can immediately see how price decreases with increasing crime rate(first image), and how it increases almost linearly with the number of rooms in the house (RM).
Now we can look at these relationships closer by plotting 2 variables to compare them to each other, here we'll see how the price grows with the number of rooms:
Now that we've explored the data a bit, let's prepare it for training. We will want to create separate varaiables for features(the data we'll use to predict prices), and labels(the price of the house).
# By convention, X means features and y stands for labels. # For features, we want to use all the data except for the price # axis = 1 is a parameter we need to pass so that pandas knows we want to delete the column. X = df.drop('PRICE', axis=1) # For labels, we want to use just the price column y = df['PRICE']
Then we will want to set aside a third of our data for testing, so we could use it later to test the quality of our model. SciKit learn already has a function that does this automatically.
from sklearn.model_selection import train_test_split # This will split data into training and test sets. # test_size tells us how much data to allocate for testing (33%), # and random state is just a seed to make results easier to reproduce X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Now we're ready to import and train the model
# import the model class for LinearRegression from sklearn.linear_model import LinearRegression # create an object - a specific model we can actually train model = LinearRegression()
To train the model we can simply call
.fit() and pass it the data. Different models requrire different parameters, but thanks to the magic of SciKit learn, all the models can be trained using this function.
Training the model will change the model object itself, saving everything the model has learned inside it.
# train the model model.fit(X_train,y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
Now that we have a trained model, we can pass it the testing data we have set aside, and use it to generate the predictions
# Creating a list of predictions for every datapoint in testing data predictions = model.predict(X_test)
Now we can look at our first prediction.
print('A house with these parameters: \n', X_test.loc) print('\n Will cost this much: \n',predictions)
A house with these parameters: CRIM 0.00632 ZN 18.00000 INDUS 2.31000 CHAS 0.00000 NOX 0.53800 RM 6.57500 AGE 65.20000 DIS 4.09000 RAD 1.00000 TAX 296.00000 PTRATIO 15.30000 B 396.90000 LSTAT 4.98000 Name: 0, dtype: float64 Will cost this much: 28.5408021214125
Awesome, we're now using our model to predict new data points!
To check the accuracy of our model, we want to use all of the testing data to find out how accurately our model will predict it, how close it's predictions will be to the actual results.
There are various ways to measure the error rate of the model, the most common one is "Root mean square error". Mean error is the average difference between what model has predicted, and what it should have predicted. We square it to make all the errors positive(otherwise predictions that are too low will compensate for predictions that are too high). Then we take the square root of the result, so that the error is measured in the units we want(in this case dollars).
from sklearn import metrics # These functions will compare the predictions with correct results(y_test) metrics.mean_absolute_error(y_test, predictions) metrics.mean_squared_error(y_test, predictions) # Take a square root to get the value in dollars np.sqrt(metrics.mean_squared_error(y_test, predictions))
We can also create a scatter plot to compare predictions with the correct results.
If our model would be perfect,this plot would look like a straight line(our predictions would perfectly match the correct answers). Becase the model is imperfect, the result is a bit noisy, but still good enough.
Another helpful way of evaluating our results is plotting the distribution of errors:
# Subtract the correct results from our predictions to get the error rate sns.distplot((y_test - predictions))
When your model is trained correctly, errors will be normally distributed, centered around zero - which is what we have here. Basically this means that the majority of our predictions is correct(0), give or take some noise. This tells us that we have chosen the right model, and we can immediately see how well it predicts our results.
Congratulations, you have learned how to load the data, explore it, train your first model, and evaluate the results! I hope this tutorial was useful to you. You can download this post in Jupyter Notebook format here.