Analyzing Data with Descriptive Statistics

The first step in any data science or machine learning project is to look at your data, analyze it, and see what insights you can gain from it without any algorithms. That will help you to understand what you're dealing with, pick the right model, and achieve best results.

In this post I will describe the basic methods for descriptive data analysis(not including data visualisation, which will be a separate post).

Import the libraries

Let's import the libraries we will need right away:

import pandas as pd

Load the data

As always, the first step is to load the data. In this project we will use one of the classic datasets that can be used to predict an onset of diabetes. You can download the file here.

Before loading the file, I've created a column_names variable that will be used to set column names.

# column_names = [ 'Pregnancies' , 'Glucose' , 'BloodPressure' , 'SkinThickness' , 'Insulin' , 'BMI' , 'DiabetesPedigreeFunction' , 'Age' , 'Outcome' ]
# , names=column_names
data = pd.read_csv('../data/pima-indians-diabetes.csv')

Look at your data

Excellent! The first step in data analysis is just looking at your raw data so you could understand what you're dealing with. Understand the purposes of the columns, what they mean, and how do the values look like(which will give you ideas of how features can be used, and how the data may need to be transformed).

You can view the first 10 rows of your data by using .head(10) method:

data.head(10)
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
5 5 116 74 0 0 25.6 0.201 30 0
6 3 78 50 32 88 31.0 0.248 26 1
7 10 115 0 0 0 35.3 0.134 29 0
8 2 197 70 45 543 30.5 0.158 53 1
9 8 125 96 0 0 0.0 0.232 54 1

To find out how many rows and columns your dataset has, you can use .shape, it will print out the dimensions of your data.

data.shape
(768, 9)

To find out the datatypes of each column, use .dtypes

data.dtypes
Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

Alternatively, you can use pandas .info() method that displays the number of rows, columns, and data types for each value. It is also useful to look at the number of non-null values, if they don't match then some data points are missing.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

As you can see, this dataset is pretty small, we have 768 examples and 9 columns that contain integer and float values.

Descriptive Statistics

The .describe() method shows you statistical properties of each column:

# Before running describe, set precision option to 3 to reduce the number of digits after the comma
pd.set_option('precision', 3) 
data.describe()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000
mean 3.845 120.895 69.105 20.536 79.799 31.993 0.472 33.241 0.349
std 3.370 31.973 19.356 15.952 115.244 7.884 0.331 11.760 0.477
min 0.000 0.000 0.000 0.000 0.000 0.000 0.078 21.000 0.000
25% 1.000 99.000 62.000 0.000 0.000 27.300 0.244 24.000 0.000
50% 3.000 117.000 72.000 23.000 30.500 32.000 0.372 29.000 0.000
75% 6.000 140.250 80.000 32.000 127.250 36.600 0.626 41.000 1.000
max 17.000 199.000 122.000 99.000 846.000 67.100 2.420 81.000 1.000
  • count is the number of datapoints in this column
  • mean is the average value
  • std is standard deviation, showing you how dispersed the values are, the average distance between data points and the mean
  • min and max are self explanatory
  • 25%,50%, and 75% rows show the corresponding percentiles.

A percentile shows the value below which a given percentage of observations falls. For example, 25% of people have glucose measurments below 99, 50% have glucose below 117, and 75% have glucose below 140.

Class distribution

We can count the number of values in each class to see if they're evenly distributed. Here, we have only two classes, Outcome is 1 when a person has diabetes, and 0 when he doesn't.

We can calculate the number of rows with each value using .value_counts()

data['Outcome'].value_counts()
0    500
1    268
Name: Outcome, dtype: int64

Correlations

When two attributes depend on each other, that means they are corellated. You can compute the standard correlation coefficient (also called Pearson’s Correlation Coefficient) for every pair of features using .corr() method:

data.corr()
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
Pregnancies 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222
Glucose 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467
BloodPressure 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065
SkinThickness -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075
Insulin -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131
BMI 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293
DiabetesPedigreeFunction -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174
Age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238
Outcome 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000

Correllation of 0 means attributes are not correlated at all, -1 is strong negative correllation, 1 is strong positive one.

As you can see every attribute perfectly correllates with itself, and the rest of the correllations vary.

If you look at the last column, you can see that the level of Glucose is the most likely to be correllated with the positive outcome(a person has diabetes), which kinda makes sense.

Skewness

When a normal(bell-curve) distribution has it's center shifted to the left or to the right, it is called skewness. If there are a lot of outliers to the right of the bell curve(top of the bell shifted to the left), it's skewed right.

To calculate the skewness of each attribute simply use .skew() method

data.skew()
Pregnancies                 0.902
Glucose                     0.174
BloodPressure              -1.844
SkinThickness               0.109
Insulin                     2.272
BMI                        -0.429
DiabetesPedigreeFunction    1.920
Age                         1.130
Outcome                     0.635
dtype: float64

As you can see, pregnancies are strongly skewed to the right(most people have 0-1 babies, and it trails off as the number of pregnancies increases).

Conclusion

This covers the most common methods for analyzing the new dataset. You can use them to gain deeper understanding of your data, and make future decisions for making data preparation, choosing your model, etc.

Try to analyze what these numbers mean, try to gain insights from this information and write down the things you have learned.

Receive weekly digest of my best posts!