Analyzing Data with Descriptive Statistics | Digital Mind

Analyzing Data with Descriptive Statistics

The first step in any data science or machine learning project is to look at your data, analyze it, and see what insights you can gain from it without any algorithms. That will help you to understand what you're dealing with, pick the right model, and achieve best results.

In this post I will describe the basic methods for descriptive data analysis(not including data visualisation, which will be a separate post).

Import the libraries

Let's import the libraries we will need right away:

import pandas as pd

Load the data

As always, the first step is to load the data. In this project we will use one of the classic datasets that can be used to predict an onset of diabetes. You can download the file here.

Before loading the file, I've created a column_names variable that will be used to set column names.

# column_names = [ 'Pregnancies' , 'Glucose' , 'BloodPressure' , 'SkinThickness' , 'Insulin' , 'BMI' , 'DiabetesPedigreeFunction' , 'Age' , 'Outcome' ]
data = pd.read_csv('../data/pima-indians-diabetes.csv')

Look at your data

Excellent! The first step in data analysis is just looking at your raw data so you could understand what you're dealing with. Understand the purposes of the columns, what they mean, and how do the values look like(which will give you ideas of how features can be used, and how the data may need to be transformed).

You can view the first 10 rows of your data by using .head(10) method:


To find out how many rows and columns your dataset has, you can use .shape, it will print out the dimensions of your data.

(768, 9)

To find out the datatypes of each column, use .dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

Alternatively, you can use pandas .info() method that displays the number of rows, columns, and data types for each value. It is also useful to look at the number of non-null values, if they don't match then some data points are missing.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

As you can see, this dataset is pretty small, we have 768 examples and 9 columns that contain integer and float values.

Descriptive Statistics

The .describe() method shows you statistical properties of each column:

# Before running describe, set precision option to 3 to reduce the number of digits after the comma
pd.set_option('precision', 3) 
  • count is the number of datapoints in this column
  • mean is the average value
  • std is standard deviation, showing you how dispersed the values are, the average distance between data points and the mean
  • min and max are self explanatory
  • 25%,50%, and 75% rows show the corresponding percentiles.

A percentile shows the value below which a given percentage of observations falls. For example, 25% of people have glucose measurments below 99, 50% have glucose below 117, and 75% have glucose below 140.

Class distribution

We can count the number of values in each class to see if they're evenly distributed. Here, we have only two classes, Outcome is 1 when a person has diabetes, and 0 when he doesn't.

We can calculate the number of rows with each value using .value_counts()

0    500
1    268
Name: Outcome, dtype: int64


When two attributes depend on each other, that means they are corellated. You can compute the standard correlation coefficient (also called Pearson’s Correlation Coefficient) for every pair of features using .corr() method:


Correllation of 0 means attributes are not correlated at all, -1 is strong negative correllation, 1 is strong positive one.

As you can see every attribute perfectly correllates with itself, and the rest of the correllations vary.

If you look at the last column, you can see that the level of Glucose is the most likely to be correllated with the positive outcome(a person has diabetes), which kinda makes sense.


When a normal(bell-curve) distribution has it's center shifted to the left or to the right, it is called skewness. If there are a lot of outliers to the right of the bell curve(top of the bell shifted to the left), it's skewed right.

To calculate the skewness of each attribute simply use .skew() method

Pregnancies                 0.902
Glucose                     0.174
BloodPressure              -1.844
SkinThickness               0.109
Insulin                     2.272
BMI                        -0.429
DiabetesPedigreeFunction    1.920
Age                         1.130
Outcome                     0.635
dtype: float64

As you can see, pregnancies are strongly skewed to the right(most people have 0-1 babies, and it trails off as the number of pregnancies increases).


This covers the most common methods for analyzing the new dataset. You can use them to gain deeper understanding of your data, and make future decisions for making data preparation, choosing your model, etc.

Try to analyze what these numbers mean, try to gain insights from this information and write down the things you have learned.

Subscribe to my weekly newsletter, receive updates on new posts: