Analyzing Data with Descriptive Statistics | Digital Mind

Analyzing Data with Descriptive Statistics

The first step in any data science or machine learning project is to look at your data, analyze it, and see what insights you can gain from it without any algorithms. That will help you to understand what you're dealing with, pick the right model, and achieve best results.

In this post I will describe the basic methods for descriptive data analysis(not including data visualisation, which will be a separate post).

Import the libraries

Let's import the libraries we will need right away:

import pandas as pd

Load the data

As always, the first step is to load the data. In this project we will use one of the classic datasets that can be used to predict an onset of diabetes. You can download the file here.

Before loading the file, I've created a column_names variable that will be used to set column names.

# column_names = [ 'Pregnancies' , 'Glucose' , 'BloodPressure' , 'SkinThickness' , 'Insulin' , 'BMI' , 'DiabetesPedigreeFunction' , 'Age' , 'Outcome' ]
data = pd.read_csv('../data/pima-indians-diabetes.csv')

Look at your data

Excellent! The first step in data analysis is just looking at your raw data so you could understand what you're dealing with. Understand the purposes of the columns, what they mean, and how do the values look like(which will give you ideas of how features can be used, and how the data may need to be transformed).

You can view the first 10 rows of your data by using .head(10) method:

data.head(10)
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
061487235033.60.627501
11856629026.60.351310
28183640023.30.672321
318966239428.10.167210
40137403516843.12.288331
55116740025.60.201300
637850328831.00.248261
71011500035.30.134290
82197704554330.50.158531
9812596000.00.232541

To find out how many rows and columns your dataset has, you can use .shape, it will print out the dimensions of your data.

data.shape
(768, 9)

To find out the datatypes of each column, use .dtypes

data.dtypes
Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

Alternatively, you can use pandas .info() method that displays the number of rows, columns, and data types for each value. It is also useful to look at the number of non-null values, if they don't match then some data points are missing.

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Pregnancies                 768 non-null int64
Glucose                     768 non-null int64
BloodPressure               768 non-null int64
SkinThickness               768 non-null int64
Insulin                     768 non-null int64
BMI                         768 non-null float64
DiabetesPedigreeFunction    768 non-null float64
Age                         768 non-null int64
Outcome                     768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

As you can see, this dataset is pretty small, we have 768 examples and 9 columns that contain integer and float values.

Descriptive Statistics

The .describe() method shows you statistical properties of each column:

# Before running describe, set precision option to 3 to reduce the number of digits after the comma
pd.set_option('precision', 3) 
data.describe()
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
count768.000768.000768.000768.000768.000768.000768.000768.000768.000
mean3.845120.89569.10520.53679.79931.9930.47233.2410.349
std3.37031.97319.35615.952115.2447.8840.33111.7600.477
min0.0000.0000.0000.0000.0000.0000.07821.0000.000
25%1.00099.00062.0000.0000.00027.3000.24424.0000.000
50%3.000117.00072.00023.00030.50032.0000.37229.0000.000
75%6.000140.25080.00032.000127.25036.6000.62641.0001.000
max17.000199.000122.00099.000846.00067.1002.42081.0001.000
  • count is the number of datapoints in this column
  • mean is the average value
  • std is standard deviation, showing you how dispersed the values are, the average distance between data points and the mean
  • min and max are self explanatory
  • 25%,50%, and 75% rows show the corresponding percentiles.

A percentile shows the value below which a given percentage of observations falls. For example, 25% of people have glucose measurments below 99, 50% have glucose below 117, and 75% have glucose below 140.

Class distribution

We can count the number of values in each class to see if they're evenly distributed. Here, we have only two classes, Outcome is 1 when a person has diabetes, and 0 when he doesn't.

We can calculate the number of rows with each value using .value_counts()

data['Outcome'].value_counts()
0    500
1    268
Name: Outcome, dtype: int64

Correlations

When two attributes depend on each other, that means they are corellated. You can compute the standard correlation coefficient (also called Pearson’s Correlation Coefficient) for every pair of features using .corr() method:

data.corr()
PregnanciesGlucoseBloodPressureSkinThicknessInsulinBMIDiabetesPedigreeFunctionAgeOutcome
Pregnancies1.0000.1290.141-0.082-0.0740.018-0.0340.5440.222
Glucose0.1291.0000.1530.0570.3310.2210.1370.2640.467
BloodPressure0.1410.1531.0000.2070.0890.2820.0410.2400.065
SkinThickness-0.0820.0570.2071.0000.4370.3930.184-0.1140.075
Insulin-0.0740.3310.0890.4371.0000.1980.185-0.0420.131
BMI0.0180.2210.2820.3930.1981.0000.1410.0360.293
DiabetesPedigreeFunction-0.0340.1370.0410.1840.1850.1411.0000.0340.174
Age0.5440.2640.240-0.114-0.0420.0360.0341.0000.238
Outcome0.2220.4670.0650.0750.1310.2930.1740.2381.000

Correllation of 0 means attributes are not correlated at all, -1 is strong negative correllation, 1 is strong positive one.

As you can see every attribute perfectly correllates with itself, and the rest of the correllations vary.

If you look at the last column, you can see that the level of Glucose is the most likely to be correllated with the positive outcome(a person has diabetes), which kinda makes sense.

Skewness

When a normal(bell-curve) distribution has it's center shifted to the left or to the right, it is called skewness. If there are a lot of outliers to the right of the bell curve(top of the bell shifted to the left), it's skewed right.

To calculate the skewness of each attribute simply use .skew() method

data.skew()
Pregnancies                 0.902
Glucose                     0.174
BloodPressure              -1.844
SkinThickness               0.109
Insulin                     2.272
BMI                        -0.429
DiabetesPedigreeFunction    1.920
Age                         1.130
Outcome                     0.635
dtype: float64

As you can see, pregnancies are strongly skewed to the right(most people have 0-1 babies, and it trails off as the number of pregnancies increases).

Conclusion

This covers the most common methods for analyzing the new dataset. You can use them to gain deeper understanding of your data, and make future decisions for making data preparation, choosing your model, etc.

Try to analyze what these numbers mean, try to gain insights from this information and write down the things you have learned.

Subscribe to my weekly newsletter, receive updates on new posts: