Preparing your data
Data preparation is an essential part of machine learning process, it helps ML models to find patterns and make predictions in the most optimal way.
When parts of the data are missing, when features have vastly different scales, (ML algorithms will not work well.
In this post you will learn the basics of data preparation:
 Dealing with missing data
 Scaling features
 Encoding categorical variables
 Pipeline
Import the libraries
First let's import all the basic libraries we'll need right away:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Load the data
In this project we're going to use a classic dataset that describes housing prices in California. You can read about it more and download it here.
Let's load it:
data = pd.read_csv('../data/housing.csv')
data.head()
longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population  households  median_income  median_house_value  ocean_proximity  

0  122.23  37.88  41.0  880.0  129.0  322.0  126.0  8.3252  452600.0  NEAR BAY 
1  122.22  37.86  21.0  7099.0  1106.0  2401.0  1138.0  8.3014  358500.0  NEAR BAY 
2  122.24  37.85  52.0  1467.0  190.0  496.0  177.0  7.2574  352100.0  NEAR BAY 
3  122.25  37.85  52.0  1274.0  235.0  558.0  219.0  5.6431  341300.0  NEAR BAY 
4  122.25  37.85  52.0  1627.0  280.0  565.0  259.0  3.8462  342200.0  NEAR BAY 
Usually data preparation is applied to features, while the target values remain intact (if you're going to predict housing prices, you want them to be in dollars). So we want to split the dataset into features and labels:
# All the columns except the one we want to predict
features = data.drop(['median_house_value'], axis=1)
# Only the column we want to predict
labels = data['median_house_value']
Dealing with missing data
ML algorithms can't work properly when some of the data is missing. Let's see if we have all the data:
features.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
longitude 20640 nonnull float64
latitude 20640 nonnull float64
housing_median_age 20640 nonnull float64
total_rooms 20640 nonnull float64
total_bedrooms 20433 nonnull float64
population 20640 nonnull float64
households 20640 nonnull float64
median_income 20640 nonnull float64
ocean_proximity 20640 nonnull object
dtypes: float64(8), object(1)
memory usage: 1.4+ MB
Looks like some values in total_bedrooms
column are missing (there's fewer of them than in the other columns).
We have a few options for dealing with that:
 Get rid of all the examples with missing data
 Get rid of the whole
total_bedrooms
attribute  Fill in the missing values (using the average number of bedrooms for example)
To do that, we can use DataFrame's .dropna()
, .drop()
, and .fillna()
methods respectively.
Let's fill the missing values. Note that we'll need to make sure to save the average(mean) value we have calculated using the training data, so that later we can use it to replace the missing values in the new data.
median = features["total_bedrooms"].median()
features_complete = features["total_bedrooms"].fillna(median)
Scaling features
Most machine learning algorithms prefer when all features have the same scale. If one feature has values between 0 and 1, and another has values between 1 and 1,000,000  you can see how that would mess up and skew all sorts distance calculations, weights, etc.
We can bring all features to the same scale in two ways:

Normalization  rescaling all the values into the range between 0 and 1.

Standardization  transforming all the values so that mean would be 0, and standard deviation would be 1.
Standardization can't guarantee that all the values will fall into a specific range (which may be a problem for neural networks, for example), but on the upside it's less affected by outliers.
Normalization can be done using MinMaxScaler
or Normalizer
, and standardization is done using StandardScaler
. All these can be imported from sklearn.preprocessing
.
Let's standardize our features:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Because of the way StandardScaler works we need to reshape
# features into 2D array before passing them to it
scaler.fit(features_complete.values.reshape(1, 1))
features_scaled = scaler.transform(features_complete.values.reshape(1, 1))
features_scaled
array([[0.97247648],
[ 1.35714343],
[0.82702426],
...,
[0.12360781],
[0.30482697],
[ 0.18875678]])
Dealing with Categorical features
We have one categorical feature in our dataset  ocean_proximity
. Instead of a continuous value it has just a few text labels:
print(features['ocean_proximity'].unique())
['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']
Since ML algorithms prefer to deal with numbers, we want to transform the strings in some way.
First, we can use LabelEncoder
to automatically turn all five categories into numbers from 1 to 5":
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
ocean_proximity = features["ocean_proximity"]
ocean_proximity_encoded = encoder.fit_transform(ocean_proximity)
ocean_proximity_encoded
array([3, 3, 3, ..., 1, 1, 1])
Now we have just one problem, our algorithms will naturally assume that numbers that are closer to each other(like 1 and 2) are more similar than numbers that are farther apart(like 1 and 5), but this doesn't make sense when it comes to categorical variables.
To fix this, we can use something called "OneHot Encoding", which represents each number as an array where all the elements are zero, except for the one representing our number. For example it would turn 3 into [0,0,1,0,0] and 5 into [0,0,0,0,1].
To do that, we can use OneHotEncoder
, it will take our array of numbers and return a matrix, where each number is represented as a vector:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
ocean_proximity_1hot = encoder.fit_transform(ocean_proximity_encoded.reshape(1,1))
ocean_proximity_1hot
<20640x5 sparse matrix of type '<class 'numpy.float64'>'
with 20640 stored elements in Compressed Sparse Row format>
Because most of the values in the resulting matrix are 0's, it represented as a "sparse matrix" (a format that can store such matrices efficiently).
To view it's values we can use .toarray()
method:
ocean_proximity_1hot.toarray()
array([[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0.],
...,
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 1., 0., 0., 0.]])
Summary
In this post you have learned how to fill in missing data, scale the features, and encode categorical variables. That's a great place to start, to learn about data preparation more, I recommend you to learn about pipelines(which make this process more efficient), and feature engineering(that allows you to create new features based on existing data). But these are the topics for my future posts.