EDA

This section introduces the Exploratory Data Analysis component of DataPrep.

Introduction to Exploratory Data Analysis and dataprep.eda

Exploratory Data Analysis (EDA) is the process of exploring a dataset and getting an understanding of its main characteristics. The dataprep.eda package simplifies this process by allowing the user to explore important characteristics with simple APIs. Each API allows the user to analyze the dataset from a high level to a low level, and from different perspectives. Specifically, dataprep.eda provides the following functionality:

  • Analyze column distributions with plot(). The function plot() explores the column distributions and statistics of the dataset. It will detect the column type, and then output various plots and statistics that are appropriate for the respective type. The user can optionally pass one or two columns of interest as parameters: If one column is passed, its distribution will be plotted in various ways, and column statistics will be computed. If two columns are passed, plots depicting the relationship between the two columns will be generated.

  • Analyze correlations with plot_correlation(). The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. By default, it plots correlation matrices with various metrics. The user can optionally pass one or two columns of interest as parameters: If one column is passed, the correlation between this column and all other columns will be computed and ranked. If two columns are passed, a scatter plot and regression line will be plotted.

  • Analyze missing values with plot_missing(). The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. By default, it will generate various plots which display the amount of missing values for each column and any underlying patterns of the missing values in the dataset. To understand the impact of the missing values in one column on the other columns, the user can pass the column name as a parameter. Then, plot_missing() will generate the distribution of each column with and without the missing values from the given column, enabling a thorough understanding of their impact.

  • Analyze column differences with plot_diff(). The function plot_diff() explores the differences of column distributions and statistics across multiple datasets. It will detect the column type, and then output various plots and statistics that are appropriate for the respective type. The user can optionally set the baseline which is used as the target dataset to compare with other datasets.

The following sections give a simple demonstration of plot(), plot_correlation(), plot_missing(), and plot_diff() using an example dataset.

Analyze distributions with plot()

The function plot() explores the distributions and statistics of the dataset. The following describes the functionality of plot() for a given dataframe df.

  1. plot(df): plots the distribution of each column and calculates dataset statistics

  2. plot(df, x): plots the distribution of column x in various ways and calculates column statistics

  3. plot(df, x, y): generates plots depicting the relationship between columns x and y

The following shows an example of plot(df). It plots a histogram for each numerical column, a bar chart for each categorical column, and computes dataset statistics.

[1]:
from dataprep.eda import plot
from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('adult')
plot(df)
[1]:
DataPrep.EDA Report
Dataset Statistics
Number of Variables 15
Number of Rows 48842
Missing Cells 0
Missing Cells (%) 0.0%
Duplicate Rows 52
Duplicate Rows (%) 0.1%
Total Size in Memory 30.2 MB
Average Row Size in Memory 649.3 B
Variable Types
  • Numerical: 6
  • Categorical: 9
Dataset Insights
fnlwgt is skewed Skewed
education-num is skewed Skewed
capital-gain is skewed Skewed
capital-loss is skewed Skewed
hours-per-week is skewed Skewed
capital-gain has 44807 (91.74%) zeros Zeros
capital-loss has 46560 (95.33%) zeros Zeros

For more information about the function plot() see here.

Analyze correlations with plot_correlation()

The function plot_correlation() explores the correlation between columns in various ways and using multiple correlation metrics. The following describes the functionality of plot_correlation() for a given dataframe df.

  1. plot_correlation(df): plots correlation matrices (correlations between all pairs of columns)

  2. plot_correlation(df, x): plots the most correlated columns to column x

  3. plot_correlation(df, x, y): plots the joint distribution of column x and column y and computes a regression line

The following shows an example of plot_correlation(). It generates correlation matrices using Pearson, Spearman, and KendallTau correlation coefficients

[2]:
from dataprep.eda import plot_correlation
from dataprep.datasets import load_dataset
df = load_dataset("wine-quality-red")
plot_correlation(df)
[2]:
DataPrep.EDA Report
Pearson Spearman KendallTau
Highest Positive Correlation 0.672 0.79 0.607
Highest Negative Correlation -0.683 -0.707 -0.528
Lowest Correlation 0.002 0.001 0.0
Mean Correlation 0.019 0.028 0.021
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (fixed_acidity, citric_acid)
  • Most negative correlated: (fixed_acidity, pH)
  • Least correlated: (volatile_acidity, residual_sugar)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (free_sulfur_d...ide, total_sulfur_...ide)
  • Most negative correlated: (fixed_acidity, pH)
  • Least correlated: (total_sulfur_...ide, sulphates)
'height': 400
Height of the plot
'width': 400
Width of the plot
  • Most positive correlated: (free_sulfur_d...ide, total_sulfur_...ide)
  • Most negative correlated: (fixed_acidity, pH)
  • Least correlated: (total_sulfur_...ide, sulphates)

For more information about the function plot_correlation() see here.

Analyze missing values with plot_missing()

The function plot_missing() enables thorough analysis of the missing values and their impact on the dataset. The following describes the functionality of plot_missing() for a given dataframe df.

  1. plot_missing(df): plots the amount and position of missing values, and their relationship between columns

  2. plot_missing(df, x): plots the impact of the missing values in column x on all other columns

  3. plot_missing(df, x, y): plots the impact of the missing values from column x on column y in various ways.

[3]:
from dataprep.eda import plot_missing
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
plot_missing(df)
[3]:
DataPrep.EDA Report

Missing Statistics

Missing Cells866
Missing Cells (%)8.1%
Missing Columns3
Missing Rows708
Avg Missing Cells per Column72.17
Avg Missing Cells per Row0.97
'height': 500
Height of the plot
'width': 500
Width of the plot
'spectrum.bins': 20
Number of bins
'height': 500
Height of the plot
'width': 500
Width of the plot
'height': 500
Height of the plot
'width': 500
Width of the plot
'height': 500
Height of the plot
'width': 500
Width of the plot

For more information about the function plot_missing() see here.

Analyze difference with plot_diff()

The function plot_diff() explores the difference of column distributions and statistics across multiple datasets. The following describes the functionality of plot_diff() for two given dataframes df1 and df2.

[4]:
from dataprep.eda import plot_diff
from dataprep.datasets import load_dataset
df1 = load_dataset("house_prices_test")
df2 = load_dataset("house_prices_train")
plot_diff([df1, df2])
[4]:
DataPrep.EDA Report
Difference Overview
df1 df2
Number of Variables 80 81
Number of Rows 1459 1460
Missing Cells 7000 6965
Missing Cells (%) 6.0% 5.9%
Duplicate Rows 0 0
Duplicate Rows (%) 0.0% 0.0%
Total Size in Memory 912.0 KB 924.0 KB
Average Row Size in Memory 910.6 KB 922.6 KB
Variable Types
  • Numerical: 26
  • Categorical: 53
  • GeoGraphy: 1
  • Numerical: 27
  • Categorical: 53
  • GeoGraphy: 1
df1
df2

For more information about the function plot_diff() see here.

Create a profile report with create_report()

The function create_report() generates a comprehensive profile report of the dataset. create_report() combines the individual components of the dataprep.eda package and outputs them into a nicely formatted HTML document. The document contains the following information:

  1. Overview: detect the types of columns in a dataframe

  2. Variables: variable type, unique values, distint count, missing values

  3. Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range

  4. Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness

  5. Text analysis for length, sample and letter

  6. Correlations: highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices

  7. Missing Values: bar chart, heatmap and spectrum of missing values

An example report can be downloaded here.

Get the intermediate data

DataPrep.EDA separates the computation and rendering, so that you can just compute the intermediate data and render it using other plotting libraries.

For each plot function, there is a corresponding compute function, which returns the computed intermediates used for rendering. For example, for plot_correlation(df) function, you can get the intermediates using compute_correlation(df). It’s a dictionary, and you can also save it to a json file.

[5]:
from dataprep.eda import compute_correlation
from dataprep.datasets import load_dataset
df = load_dataset("titanic")
imdt = compute_correlation(df)
imdt.save("imdt.json")
imdt
Intermediate has been saved to imdt.json!
[5]:
{'data': {'Pearson': {'x': {1: 'PassengerId',
    2: 'PassengerId',
    3: 'PassengerId',
    4: 'PassengerId',
    5: 'PassengerId',
    6: 'PassengerId',
    9: 'Survived',
    10: 'Survived',
    11: 'Survived',
    12: 'Survived',
    13: 'Survived',
    17: 'Pclass',
    18: 'Pclass',
    19: 'Pclass',
    20: 'Pclass',
    25: 'Age',
    26: 'Age',
    27: 'Age',
    33: 'SibSp',
    34: 'SibSp',
    41: 'Parch'},
   'y': {1: 'Survived',
    2: 'Pclass',
    3: 'Age',
    4: 'SibSp',
    5: 'Parch',
    6: 'Fare',
    9: 'Pclass',
    10: 'Age',
    11: 'SibSp',
    12: 'Parch',
    13: 'Fare',
    17: 'Age',
    18: 'SibSp',
    19: 'Parch',
    20: 'Fare',
    25: 'SibSp',
    26: 'Parch',
    27: 'Fare',
    33: 'Parch',
    34: 'Fare',
    41: 'Fare'},
   'correlation': {1: -0.005006660767066476,
    2: -0.03514399403037967,
    3: 0.03684719786132784,
    4: -0.057526833784441705,
    5: -0.0016520124027188286,
    6: 0.01265821928749123,
    9: -0.33848103596101586,
    10: -0.07722109457217737,
    11: -0.03532249888573588,
    12: 0.08162940708348222,
    13: 0.2573065223849618,
    17: -0.36922601531551574,
    18: 0.0830813628456866,
    19: 0.01844267131074835,
    20: -0.5494996199439061,
    25: -0.3082467589236574,
    26: -0.18911926263203518,
    27: 0.09606669176903881,
    33: 0.41483769862015263,
    34: 0.15965104324216103,
    41: 0.21622494477076254}},
  'Spearman': {'x': {1: 'PassengerId',
    2: 'PassengerId',
    3: 'PassengerId',
    4: 'PassengerId',
    5: 'PassengerId',
    6: 'PassengerId',
    9: 'Survived',
    10: 'Survived',
    11: 'Survived',
    12: 'Survived',
    13: 'Survived',
    17: 'Pclass',
    18: 'Pclass',
    19: 'Pclass',
    20: 'Pclass',
    25: 'Age',
    26: 'Age',
    27: 'Age',
    33: 'SibSp',
    34: 'SibSp',
    41: 'Parch'},
   'y': {1: 'Survived',
    2: 'Pclass',
    3: 'Age',
    4: 'SibSp',
    5: 'Parch',
    6: 'Fare',
    9: 'Pclass',
    10: 'Age',
    11: 'SibSp',
    12: 'Parch',
    13: 'Fare',
    17: 'Age',
    18: 'SibSp',
    19: 'Parch',
    20: 'Fare',
    25: 'SibSp',
    26: 'Parch',
    27: 'Fare',
    33: 'Parch',
    34: 'Fare',
    41: 'Fare'},
   'correlation': {1: -0.005006660767066498,
    2: -0.03409135008914179,
    3: 0.04100991613236293,
    4: -0.06116076582604884,
    5: 0.0012351780934194748,
    6: -0.013975133780990471,
    9: -0.3396679366500525,
    10: -0.052565300044694487,
    11: 0.08887948468090501,
    12: 0.13826563286545587,
    13: 0.3237361394448083,
    17: -0.36166557503434504,
    18: -0.04301876651204207,
    19: -0.022801341928590464,
    20: -0.6880316726256096,
    25: -0.1820612589179174,
    26: -0.2542121174301802,
    27: 0.13505121773428777,
    33: 0.45001397100861634,
    34: 0.4471129882944581,
    41: 0.4100738082761382}},
  'KendallTau': {'x': {1: 'PassengerId',
    2: 'PassengerId',
    3: 'PassengerId',
    4: 'PassengerId',
    5: 'PassengerId',
    6: 'PassengerId',
    9: 'Survived',
    10: 'Survived',
    11: 'Survived',
    12: 'Survived',
    13: 'Survived',
    17: 'Pclass',
    18: 'Pclass',
    19: 'Pclass',
    20: 'Pclass',
    25: 'Age',
    26: 'Age',
    27: 'Age',
    33: 'SibSp',
    34: 'SibSp',
    41: 'Parch'},
   'y': {1: 'Survived',
    2: 'Pclass',
    3: 'Age',
    4: 'SibSp',
    5: 'Parch',
    6: 'Fare',
    9: 'Pclass',
    10: 'Age',
    11: 'SibSp',
    12: 'Parch',
    13: 'Fare',
    17: 'Age',
    18: 'SibSp',
    19: 'Parch',
    20: 'Fare',
    25: 'SibSp',
    26: 'Parch',
    27: 'Fare',
    33: 'Parch',
    34: 'Fare',
    41: 'Fare'},
   'correlation': {1: -0.004090214762393426,
    2: -0.026824400986911346,
    3: 0.02754181401332336,
    4: -0.04839417859092306,
    5: 0.0007978451239667482,
    6: -0.008920866826633959,
    9: -0.32353318439409545,
    10: -0.043385054517253836,
    11: 0.08591509091074537,
    12: 0.13393261225325737,
    13: 0.2662286416742869,
    17: -0.2860814161328999,
    18: -0.03955236574306877,
    19: -0.021019471733083554,
    20: -0.5735307309748154,
    25: -0.14274551945143282,
    26: -0.20011172214961384,
    27: 0.0932489072038393,
    33: 0.4252407973704515,
    34: 0.35826215386190535,
    41: 0.3303597642072928}}},
 'axis_range': ['PassengerId',
  'Survived',
  'Pclass',
  'Age',
  'SibSp',
  'Parch',
  'Fare'],
 'tabledata': {'Highest Positive Correlation': {'Pearson': 0.415,
   'Spearman': 0.45,
   'KendallTau': 0.425},
  'Highest Negative Correlation': {'Pearson': -0.549,
   'Spearman': -0.688,
   'KendallTau': -0.574},
  'Lowest Correlation': {'Pearson': 0.002,
   'Spearman': 0.001,
   'KendallTau': 0.001},
  'Mean Correlation': {'Pearson': -0.024,
   'Spearman': -0.001,
   'KendallTau': 0.0}},
 'insights': {'Pearson': ['Most positive correlated: (SibSp, Parch)',
   'Most negative correlated: (Pclass, Fare)',
   'Least correlated: (PassengerId, Parch)'],
  'Spearman': ['Most positive correlated: (SibSp, Parch)',
   'Most negative correlated: (Pclass, Fare)',
   'Least correlated: (PassengerId, Parch)'],
  'KendallTau': ['Most positive correlated: (SibSp, Parch)',
   'Most negative correlated: (Pclass, Fare)',
   'Least correlated: (PassengerId, Parch)']}}

Specifying colors

The supported colors of DataPrep.EDA match those of the Bokeh library. Color values can be provided in any of the following ways:

  • any of the 147 named CSS colors, e.g ‘green’, ‘indigo’

  • an RGB(A) hex value, e.g., ‘#FF0000’, ‘#44444444’

  • a 3-tuple of integers (r,g,b) between 0 and 255

  • a 4-tuple of (r,g,b,a) where r, g, b are integers between 0 and 255 and a is a floating point value between 0 and 1