Analyzing Data with Descriptive Statistics
The first step in any data science or machine learning project is to look at your data, analyze it, and see what insights you can gain from it without any algorithms. That will help you to understand what you're dealing with, pick the right model, and achieve best results.
In this post I will describe the basic methods for descriptive data analysis(not including data visualisation, which will be a separate post).
Import the libraries
Let's import the libraries we will need right away:
import pandas as pd
Load the data
As always, the first step is to load the data. In this project we will use one of the classic datasets that can be used to predict an onset of diabetes. You can download the file here.
Before loading the file, I've created a column_names variable that will be used to set column names.
# column_names = [ 'Pregnancies' , 'Glucose' , 'BloodPressure' , 'SkinThickness' , 'Insulin' , 'BMI' , 'DiabetesPedigreeFunction' , 'Age' , 'Outcome' ] data = pd.read_csv('../data/pima-indians-diabetes.csv')
Look at your data
Excellent! The first step in data analysis is just looking at your raw data so you could understand what you're dealing with. Understand the purposes of the columns, what they mean, and how do the values look like(which will give you ideas of how features can be used, and how the data may need to be transformed).
You can view the first 10 rows of your data by using
To find out how many rows and columns your dataset has, you can use
.shape, it will print out the dimensions of your data.
To find out the datatypes of each column, use
Pregnancies int64 Glucose int64 BloodPressure int64 SkinThickness int64 Insulin int64 BMI float64 DiabetesPedigreeFunction float64 Age int64 Outcome int64 dtype: object
Alternatively, you can use pandas
.info() method that displays the number of rows, columns, and data types for each value. It is also useful to look at the number of non-null values, if they don't match then some data points are missing.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 768 entries, 0 to 767 Data columns (total 9 columns): Pregnancies 768 non-null int64 Glucose 768 non-null int64 BloodPressure 768 non-null int64 SkinThickness 768 non-null int64 Insulin 768 non-null int64 BMI 768 non-null float64 DiabetesPedigreeFunction 768 non-null float64 Age 768 non-null int64 Outcome 768 non-null int64 dtypes: float64(2), int64(7) memory usage: 54.1 KB
As you can see, this dataset is pretty small, we have 768 examples and 9 columns that contain integer and float values.
.describe() method shows you statistical properties of each column:
# Before running describe, set precision option to 3 to reduce the number of digits after the comma pd.set_option('precision', 3) data.describe()
- count is the number of datapoints in this column
- mean is the average value
- std is standard deviation, showing you how dispersed the values are, the average distance between data points and the mean
- min and max are self explanatory
- 25%,50%, and 75% rows show the corresponding percentiles.
A percentile shows the value below which a given percentage of observations falls. For example, 25% of people have glucose measurments below 99, 50% have glucose below 117, and 75% have glucose below 140.
We can count the number of values in each class to see if they're evenly distributed. Here, we have only two classes, Outcome is 1 when a person has diabetes, and 0 when he doesn't.
We can calculate the number of rows with each value using
0 500 1 268 Name: Outcome, dtype: int64
When two attributes depend on each other, that means they are corellated. You can compute the standard correlation coefficient (also called Pearson’s Correlation Coefficient) for every pair of features using
Correllation of 0 means attributes are not correlated at all, -1 is strong negative correllation, 1 is strong positive one.
As you can see every attribute perfectly correllates with itself, and the rest of the correllations vary.
If you look at the last column, you can see that the level of Glucose is the most likely to be correllated with the positive outcome(a person has diabetes), which kinda makes sense.
When a normal(bell-curve) distribution has it's center shifted to the left or to the right, it is called skewness. If there are a lot of outliers to the right of the bell curve(top of the bell shifted to the left), it's skewed right.
To calculate the skewness of each attribute simply use
Pregnancies 0.902 Glucose 0.174 BloodPressure -1.844 SkinThickness 0.109 Insulin 2.272 BMI -0.429 DiabetesPedigreeFunction 1.920 Age 1.130 Outcome 0.635 dtype: float64
As you can see, pregnancies are strongly skewed to the right(most people have 0-1 babies, and it trails off as the number of pregnancies increases).
This covers the most common methods for analyzing the new dataset. You can use them to gain deeper understanding of your data, and make future decisions for making data preparation, choosing your model, etc.
Try to analyze what these numbers mean, try to gain insights from this information and write down the things you have learned.