Data Preparation

Data preparation is an essential part of machine learning process, it helps ML models to find patterns and make predictions in the most optimal way.

When parts of the data are missing, when features have vastly different scales, (ML algorithms will not work well.

In this post you will learn the basics of data preparation:

Dealing with missing data
Scaling features
Encoding categorical variables
Pipeline

Import the libraries

First let's import all the basic libraries we'll need right away:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Load the data

In this project we're going to use a classic dataset that describes housing prices in California. You can read about it more and download it here.

Let's load it:

data = pd.read_csv('../data/housing.csv')
data.head()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

Usually data preparation is applied to features, while the target values remain intact (if you're going to predict housing prices, you want them to be in dollars). So we want to split the dataset into features and labels:

# All the columns except the one we want to predict
features = data.drop(['median_house_value'], axis=1)
# Only the column we want to predict
labels = data['median_house_value']

Dealing with missing data

ML algorithms can't work properly when some of the data is missing. Let's see if we have all the data:

features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 9 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(8), object(1)
memory usage: 1.4+ MB

Looks like some values in total_bedrooms column are missing (there's fewer of them than in the other columns).

We have a few options for dealing with that:

Get rid of all the examples with missing data
Get rid of the whole total_bedrooms attribute
Fill in the missing values (using the average number of bedrooms for example)

To do that, we can use DataFrame's .dropna(), .drop(), and .fillna() methods respectively.

Let's fill the missing values. Note that we'll need to make sure to save the average(mean) value we have calculated using the training data, so that later we can use it to replace the missing values in the new data.

median = features["total_bedrooms"].median()
features_complete = features["total_bedrooms"].fillna(median)

Scaling features

Most machine learning algorithms prefer when all features have the same scale. If one feature has values between 0 and 1, and another has values between 1 and 1,000,000 - you can see how that would mess up and skew all sorts distance calculations, weights, etc.

We can bring all features to the same scale in two ways:

Normalization - rescaling all the values into the range between 0 and 1.
Standardization - transforming all the values so that mean would be 0, and standard deviation would be 1.

Standardization can't guarantee that all the values will fall into a specific range (which may be a problem for neural networks, for example), but on the upside it's less affected by outliers.

Normalization can be done using MinMaxScaler or Normalizer, and standardization is done using StandardScaler. All these can be imported from sklearn.preprocessing.

Let's standardize our features:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Because of the way StandardScaler works we need to reshape 
# features into 2D array before passing them to it
scaler.fit(features_complete.values.reshape(-1, 1))
features_scaled = scaler.transform(features_complete.values.reshape(-1, 1))
features_scaled

array([[-0.97247648],
       [ 1.35714343],
       [-0.82702426],
       ...,
       [-0.12360781],
       [-0.30482697],
       [ 0.18875678]])

Dealing with Categorical features

We have one categorical feature in our dataset - ocean_proximity. Instead of a continuous value it has just a few text labels:

print(features['ocean_proximity'].unique())

['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']

Since ML algorithms prefer to deal with numbers, we want to transform the strings in some way.

First, we can use LabelEncoder to automatically turn all five categories into numbers from 1 to 5":

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
ocean_proximity = features["ocean_proximity"]
ocean_proximity_encoded = encoder.fit_transform(ocean_proximity)
ocean_proximity_encoded

array([3, 3, 3, ..., 1, 1, 1])

Now we have just one problem, our algorithms will naturally assume that numbers that are closer to each other(like 1 and 2) are more similar than numbers that are farther apart(like 1 and 5), but this doesn't make sense when it comes to categorical variables.

To fix this, we can use something called "One-Hot Encoding", which represents each number as an array where all the elements are zero, except for the one representing our number. For example it would turn 3 into [0,0,1,0,0] and 5 into [0,0,0,0,1].

To do that, we can use OneHotEncoder, it will take our array of numbers and return a matrix, where each number is represented as a vector:

from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
ocean_proximity_1hot = encoder.fit_transform(ocean_proximity_encoded.reshape(-1,1))
ocean_proximity_1hot

<20640x5 sparse matrix of type '<class 'numpy.float64'>'
    with 20640 stored elements in Compressed Sparse Row format>

Because most of the values in the resulting matrix are 0's, it represented as a "sparse matrix" (a format that can store such matrices efficiently).

To view it's values we can use .toarray() method:

ocean_proximity_1hot.toarray()

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

Summary

In this post you have learned how to fill in missing data, scale the features, and encode categorical variables. That's a great place to start, to learn about data preparation more, I recommend you to learn about pipelines(which make this process more efficient), and feature engineering(that allows you to create new features based on existing data). But these are the topics for my future posts.

Subscribe to my weekly newsletter, receive updates on new posts: