Data preparation is an essential part of machine learning process, it helps ML models to find patterns and make predictions in the most optimal way.
When parts of the data are missing, when features have vastly different scales, (ML algorithms will not work well.
In this post you will learn the basics of data preparation:
- Dealing with missing data
- Scaling features
- Encoding categorical variables
First let's import all the basic libraries we'll need right away:
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
In this project we're going to use a classic dataset that describes housing prices in California. You can read about it more and download it here.
Let's load it:
data = pd.read_csv('../data/housing.csv') data.head()
Usually data preparation is applied to features, while the target values remain intact (if you're going to predict housing prices, you want them to be in dollars). So we want to split the dataset into features and labels:
# All the columns except the one we want to predict features = data.drop(['median_house_value'], axis=1) # Only the column we want to predict labels = data['median_house_value']
ML algorithms can't work properly when some of the data is missing. Let's see if we have all the data:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 9 columns): longitude 20640 non-null float64 latitude 20640 non-null float64 housing_median_age 20640 non-null float64 total_rooms 20640 non-null float64 total_bedrooms 20433 non-null float64 population 20640 non-null float64 households 20640 non-null float64 median_income 20640 non-null float64 ocean_proximity 20640 non-null object dtypes: float64(8), object(1) memory usage: 1.4+ MB
Looks like some values in
total_bedrooms column are missing (there's fewer of them than in the other columns).
We have a few options for dealing with that:
- Get rid of all the examples with missing data
- Get rid of the whole
- Fill in the missing values (using the average number of bedrooms for example)
To do that, we can use DataFrame's
.fillna() methods respectively.
Let's fill the missing values. Note that we'll need to make sure to save the average(mean) value we have calculated using the training data, so that later we can use it to replace the missing values in the new data.
median = features["total_bedrooms"].median() features_complete = features["total_bedrooms"].fillna(median)
Most machine learning algorithms prefer when all features have the same scale. If one feature has values between 0 and 1, and another has values between 1 and 1,000,000 - you can see how that would mess up and skew all sorts distance calculations, weights, etc.
We can bring all features to the same scale in two ways:
Normalization - rescaling all the values into the range between 0 and 1.
Standardization - transforming all the values so that mean would be 0, and standard deviation would be 1.
Standardization can't guarantee that all the values will fall into a specific range (which may be a problem for neural networks, for example), but on the upside it's less affected by outliers.
Normalization can be done using
Normalizer, and standardization is done using
StandardScaler. All these can be imported from
Let's standardize our features:
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() # Because of the way StandardScaler works we need to reshape # features into 2D array before passing them to it scaler.fit(features_complete.values.reshape(-1, 1)) features_scaled = scaler.transform(features_complete.values.reshape(-1, 1)) features_scaled
array([[-0.97247648], [ 1.35714343], [-0.82702426], ..., [-0.12360781], [-0.30482697], [ 0.18875678]])
We have one categorical feature in our dataset -
ocean_proximity. Instead of a continuous value it has just a few text labels:
['NEAR BAY' '<1H OCEAN' 'INLAND' 'NEAR OCEAN' 'ISLAND']
Since ML algorithms prefer to deal with numbers, we want to transform the strings in some way.
First, we can use
LabelEncoder to automatically turn all five categories into numbers from 1 to 5":
from sklearn.preprocessing import LabelEncoder encoder = LabelEncoder() ocean_proximity = features["ocean_proximity"] ocean_proximity_encoded = encoder.fit_transform(ocean_proximity) ocean_proximity_encoded
array([3, 3, 3, ..., 1, 1, 1])
Now we have just one problem, our algorithms will naturally assume that numbers that are closer to each other(like 1 and 2) are more similar than numbers that are farther apart(like 1 and 5), but this doesn't make sense when it comes to categorical variables.
To fix this, we can use something called "One-Hot Encoding", which represents each number as an array where all the elements are zero, except for the one representing our number. For example it would turn 3 into [0,0,1,0,0] and 5 into [0,0,0,0,1].
To do that, we can use
OneHotEncoder, it will take our array of numbers and return a matrix, where each number is represented as a vector:
from sklearn.preprocessing import OneHotEncoder encoder = OneHotEncoder() ocean_proximity_1hot = encoder.fit_transform(ocean_proximity_encoded.reshape(-1,1)) ocean_proximity_1hot
<20640x5 sparse matrix of type '<class 'numpy.float64'>' with 20640 stored elements in Compressed Sparse Row format>
Because most of the values in the resulting matrix are 0's, it represented as a "sparse matrix" (a format that can store such matrices efficiently).
To view it's values we can use
array([[0., 0., 0., 1., 0.], [0., 0., 0., 1., 0.], [0., 0., 0., 1., 0.], ..., [0., 1., 0., 0., 0.], [0., 1., 0., 0., 0.], [0., 1., 0., 0., 0.]])
In this post you have learned how to fill in missing data, scale the features, and encode categorical variables. That's a great place to start, to learn about data preparation more, I recommend you to learn about pipelines(which make this process more efficient), and feature engineering(that allows you to create new features based on existing data). But these are the topics for my future posts.