Analyzing Data with Descriptive Statistics
The first step in any data science or machine learning project is to look at your data, analyze it, and see what insights you can gain from it without any algorithms. That will help you to understand what you're dealing with, pick the right model, and achieve best results.
In this post I will describe the basic methods for descriptive data analysis(not including data visualisation, which will be a separate post).
Import the libraries
Let's import the libraries we will need right away:
import pandas as pd
Load the data
As always, the first step is to load the data. In this project we will use one of the classic datasets that can be used to predict an onset of diabetes. You can download the file here.
Before loading the file, I've created a column_names variable that will be used to set column names.
# column_names = [ 'Pregnancies' , 'Glucose' , 'BloodPressure' , 'SkinThickness' , 'Insulin' , 'BMI' , 'DiabetesPedigreeFunction' , 'Age' , 'Outcome' ]
data = pd.read_csv('../data/pima-indians-diabetes.csv')
Look at your data
Excellent! The first step in data analysis is just looking at your raw data so you could understand what you're dealing with. Understand the purposes of the columns, what they mean, and how do the values look like(which will give you ideas of how features can be used, and how the data may need to be transformed).
You can view the first 10 rows of your data by using .head(10)
method:
data.head(10)
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
5 | 5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
6 | 3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | 1 |
7 | 10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | 0 |
8 | 2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |
9 | 8 | 125 | 96 | 0 | 0 | 0.0 | 0.232 | 54 | 1 |
To find out how many rows and columns your dataset has, you can use .shape
, it will print out the dimensions of your data.
data.shape
(768, 9)
To find out the datatypes of each column, use .dtypes
data.dtypes
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
Alternatively, you can use pandas .info()
method that displays the number of rows, columns, and data types for each value. It is also useful to look at the number of non-null values, if they don't match then some data points are missing.
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies 768 non-null int64
Glucose 768 non-null int64
BloodPressure 768 non-null int64
SkinThickness 768 non-null int64
Insulin 768 non-null int64
BMI 768 non-null float64
DiabetesPedigreeFunction 768 non-null float64
Age 768 non-null int64
Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
As you can see, this dataset is pretty small, we have 768 examples and 9 columns that contain integer and float values.
Descriptive Statistics
The .describe()
method shows you statistical properties of each column:
# Before running describe, set precision option to 3 to reduce the number of digits after the comma
pd.set_option('precision', 3)
data.describe()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000 | 768.000 | 768.000 | 768.000 | 768.000 | 768.000 | 768.000 | 768.000 | 768.000 |
mean | 3.845 | 120.895 | 69.105 | 20.536 | 79.799 | 31.993 | 0.472 | 33.241 | 0.349 |
std | 3.370 | 31.973 | 19.356 | 15.952 | 115.244 | 7.884 | 0.331 | 11.760 | 0.477 |
min | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.078 | 21.000 | 0.000 |
25% | 1.000 | 99.000 | 62.000 | 0.000 | 0.000 | 27.300 | 0.244 | 24.000 | 0.000 |
50% | 3.000 | 117.000 | 72.000 | 23.000 | 30.500 | 32.000 | 0.372 | 29.000 | 0.000 |
75% | 6.000 | 140.250 | 80.000 | 32.000 | 127.250 | 36.600 | 0.626 | 41.000 | 1.000 |
max | 17.000 | 199.000 | 122.000 | 99.000 | 846.000 | 67.100 | 2.420 | 81.000 | 1.000 |
- count is the number of datapoints in this column
- mean is the average value
- std is standard deviation, showing you how dispersed the values are, the average distance between data points and the mean
- min and max are self explanatory
- 25%,50%, and 75% rows show the corresponding percentiles.
A percentile shows the value below which a given percentage of observations falls. For example, 25% of people have glucose measurments below 99, 50% have glucose below 117, and 75% have glucose below 140.
Class distribution
We can count the number of values in each class to see if they're evenly distributed. Here, we have only two classes, Outcome is 1 when a person has diabetes, and 0 when he doesn't.
We can calculate the number of rows with each value using .value_counts()
data['Outcome'].value_counts()
0 500
1 268
Name: Outcome, dtype: int64
Correlations
When two attributes depend on each other, that means they are corellated. You can compute the standard correlation coefficient (also called Pearson’s Correlation Coefficient) for every pair of features using .corr()
method:
data.corr()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
Pregnancies | 1.000 | 0.129 | 0.141 | -0.082 | -0.074 | 0.018 | -0.034 | 0.544 | 0.222 |
Glucose | 0.129 | 1.000 | 0.153 | 0.057 | 0.331 | 0.221 | 0.137 | 0.264 | 0.467 |
BloodPressure | 0.141 | 0.153 | 1.000 | 0.207 | 0.089 | 0.282 | 0.041 | 0.240 | 0.065 |
SkinThickness | -0.082 | 0.057 | 0.207 | 1.000 | 0.437 | 0.393 | 0.184 | -0.114 | 0.075 |
Insulin | -0.074 | 0.331 | 0.089 | 0.437 | 1.000 | 0.198 | 0.185 | -0.042 | 0.131 |
BMI | 0.018 | 0.221 | 0.282 | 0.393 | 0.198 | 1.000 | 0.141 | 0.036 | 0.293 |
DiabetesPedigreeFunction | -0.034 | 0.137 | 0.041 | 0.184 | 0.185 | 0.141 | 1.000 | 0.034 | 0.174 |
Age | 0.544 | 0.264 | 0.240 | -0.114 | -0.042 | 0.036 | 0.034 | 1.000 | 0.238 |
Outcome | 0.222 | 0.467 | 0.065 | 0.075 | 0.131 | 0.293 | 0.174 | 0.238 | 1.000 |
Correllation of 0 means attributes are not correlated at all, -1 is strong negative correllation, 1 is strong positive one.
As you can see every attribute perfectly correllates with itself, and the rest of the correllations vary.
If you look at the last column, you can see that the level of Glucose is the most likely to be correllated with the positive outcome(a person has diabetes), which kinda makes sense.
Skewness
When a normal(bell-curve) distribution has it's center shifted to the left or to the right, it is called skewness. If there are a lot of outliers to the right of the bell curve(top of the bell shifted to the left), it's skewed right.
To calculate the skewness of each attribute simply use .skew()
method
data.skew()
Pregnancies 0.902
Glucose 0.174
BloodPressure -1.844
SkinThickness 0.109
Insulin 2.272
BMI -0.429
DiabetesPedigreeFunction 1.920
Age 1.130
Outcome 0.635
dtype: float64
As you can see, pregnancies are strongly skewed to the right(most people have 0-1 babies, and it trails off as the number of pregnancies increases).
Conclusion
This covers the most common methods for analyzing the new dataset. You can use them to gain deeper understanding of your data, and make future decisions for making data preparation, choosing your model, etc.
Try to analyze what these numbers mean, try to gain insights from this information and write down the things you have learned.