`plot()`: analyze distributions¶

Overview¶

The function plot() explores the distributions and statistics of the dataset. It generates a variety of visualizations and statistics which enables the user to achieve a comprehensive understanding of the column distributions and their relationships. The following describes the functionality of plot() for a given dataframe df.

plot(df): plots the distribution of each column and computes dataset statistics
plot(df, col1): plots the distribution of column col1 in various ways, and computes its statistics
plot(df, col1, col2): generates plots depicting the relationship between columns col1 and col2

The generated plots are different for numerical, categorical and geography columns. The following table summarizes the output for the different column types.

`col1`	`col2`	Output
None	None	dataset statistics, histogram or bar chart for each column
Numerical	None	column statistics, histogram, kde plot, qq-normal plot, box plot
Categorical	None	column statistics, bar chart, pie chart, word cloud, word frequencies
Geography	None	column statistics, bar chart, pie chart, word cloud, word frequencies, world map
Numerical	Numerical	scatter plot, hexbin plot, binned box plot
Numerical	Categorical	categorical box plot, multi-line chart
Categorical	Numerical	categorical box plot, multi-line chart
Categorical	Categorical	nested bar chart, stacked bar chart, heat map
Categorical	Geography	nested bar chart, stacked bar chart, heat map
Geography	Categorical	nested bar chart, stacked bar chart, heat map
Geopoint	Categorical	nested bar chart, stacked bar chart, heat map
Categorical	Geopoint	nested bar chart, stacked bar chart, heat map
Numerical	Geography	categorical box plot, multi-line chart, world map
Geography	Numerical	categorical box plot, multi-line chart, world map
Numerical	Geopoint	geo map
Geopoint	Numerical	geo map

Next, we demonstrate the functionality of plot().

Load the dataset¶

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known adult dataset into a Pandas dataframe using the load_dataset function.

[1]:

from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('adult')
df = df.replace(" ?", np.NaN)

Get an overview of the dataset with `plot(df)`¶

We start by calling plot(df) which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column. The number of bins in the histogram can be specified with the parameter bins, and the number of categories in the bar chart can be specified with the parameter ngroups. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots.

[2]:

from dataprep.eda import plot
plot(df)

[2]:

DataPrep.EDA Report

Stats and Insights

Dataset Statistics

Number of Variables	15
Number of Rows	48842
Missing Cells	6465
Missing Cells (%)	0.9%
Duplicate Rows	52
Duplicate Rows (%)	0.1%
Total Size in Memory	30.1 MB
Average Row Size in Memory	645.7 B
Variable Types	Numerical: 6 Categorical: 9

Dataset Insights

workclass has 2799 (5.73%) missing values	Missing
occupation has 2809 (5.75%) missing values	Missing
native-country has 857 (1.75%) missing values	Missing
fnlwgt is skewed	Skewed
education-num is skewed	Skewed
capital-gain is skewed	Skewed
capital-loss is skewed	Skewed
hours-per-week is skewed	Skewed
capital-gain has 44807 (91.74%) zeros	Zeros
capital-loss has 46560 (95.33%) zeros	Zeros

Understand a column with `plot(df, col1)`¶

After getting an overview of the dataset, we can thoroughly investigate a column of interest col1 using plot(df, col1). The output is of plot(df, col1) is different for numerical and categorical columns.

When col1 is a numerical column, it computes column statistics, and generates a histogram, kde plot, box plot and qq-normal plot:

[3]:

plot(df, "age")

[3]:

DataPrep.EDA Report

Stats Histogram KDE Plot Normal Q-Q Plot Box Plot Value Table

Overview

Approximate Distinct Count	74
Approximate Unique (%)	0.2%
Missing	0
Missing (%)	0.0%
Infinite	0
Infinite (%)	0.0%
Memory Size	763.2 KB
Mean	38.6436
Minimum	17
Maximum	90
Zeros	0
Zeros (%)	0.0%
Negatives	0
Negatives (%)	0.0%

Quantile Statistics

Minimum	17
5-th Percentile	19
Q1	28
Median	37
Q3	48
95-th Percentile	63
Maximum	90
Range	73
IQR	20

Descriptive Statistics

Mean	38.6436
Standard Deviation	13.7105
Variance	187.9781
Sum	1.8874e+06
Skewness	0.5576
Kurtosis	-0.1844
Coefficient of Variation	0.3548

'hist.bins': 50

Number of bins in the histogram

'hist.yscale': 'linear'

Y-axis scale ("linear" or "log")

'hist.color': '#aec7e8'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

age is skewed right (γ1 = 0.5576)

'kde.bins': 50

Number of bins in the histogram

'kde.yscale': 'linear'

Y-axis scale ("linear" or "log")

'kde.hist_color': '#aec7e8'

Color of the density histogram

'kde.line_color': '#d62728'

Color of the density line

'height': 400

Height of the plot

'width': 450

Width of the plot

'qqnorm.point_color': #1f77b4

Color of the points

'qqnorm.line_color': #d62728

Color of the line

'height': 400

Height of the plot

'width': 450

Width of the plot

'box.color': #1f77b4

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

age has 216 outliers

'value_table.ngroups': 10

The number of distinct values to show

Value	Count	Frequency (%)
36	1348	2.8%
35	1337	2.7%
33	1335	2.7%
23	1329	2.7%
31	1325	2.7%
34	1303	2.7%
28	1280	2.6%
37	1280	2.6%
30	1278	2.6%
38	1264	2.6%
Other values (64)	35763	73.2%

When x is a categorical column, it computes column statistics, and plots a bar chart, pie chart, word cloud, word frequency and word length:

[4]:

plot(df, "education")

[4]:

DataPrep.EDA Report

Stats Bar Chart Pie Chart Word Cloud Word Frequency Word Length Value Table

Overview

Approximate Distinct Count	16
Approximate Unique (%)	0.0%
Missing	0
Missing (%)	0.0%
Memory Size	3.5 MB

Length

Mean	9.4221
Standard Deviation	2.4401
Median	8
Minimum	4
Maximum	13

Sample

1st row	11th
2nd row	HS-grad
3rd row	Assoc-acdm
4th row	Some-college
5th row	Some-college

Letter

Count	366588
Lowercase Letter	308287
Space Separator	48842
Uppercase Letter	58301
Dash Punctuation	32869
Decimal Number	11894

'bar.bars': 10

Maximum number of bars to display

'bar.sort_descending': True

Whether to sort the bars in descending order

'bar.yscale': 'linear'

Y-axis scale ("linear" or "log")

'bar.color': '#1f77b4'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

'pie.slices': 10

Maximum number of pie slices to display

'pie.sort_descending': True

Whether to sort the slices in descending order of frequency

'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896', '#9467bd', '#c5b0d5', '#8c564b']

List of colors

'height': 400

Height of the plot

'width': 450

Width of the plot

The top 2 categories ( HS-grad, Some-college) take over 50.0%

'wordcloud.top_words': 30

Maximum number of most frequent words to display

'wordcloud.stopword': True

Whether to remove stopwords

'wordcloud.lemmatize': False

Whether to lemmatize the words

'wordcloud.stem': False

Whether to apply Potter Stem on the words

'height': 400

Height of the plot

'width': 450

Width of the plot

'wordfreq.top_words': 30

Maximum number of most frequent words to display

'wordfreq.stopword': True

Whether to remove stopwords

'wordfreq.lemmatize': False

Whether to lemmatize the words

'wordfreq.stem': False

Whether to apply Potter Stem on the words

'wordfreq.color': #1f77b4

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

'wordlen.bins': 50

Number of bins in the histogram

'wordlen.yscale': 'linear'

Y-axis scale ("linear" or "log")

'wordlen.color': '#aec7e8'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

'value_table.ngroups': 10

The number of distinct values to show

Value	Count	Frequency (%)
HS-grad	15784	32.3%
Some-college	10878	22.3%
Bachelors	8025	16.4%
Masters	2657	5.4%
Assoc-voc	2061	4.2%
11th	1812	3.7%
Assoc-acdm	1601	3.3%
10th	1389	2.8%
7th-8th	955	2.0%
Prof-school	834	1.7%
Other values (6)	2846	5.8%

When x is a Geography column, it computes column statistics, and plots a bar chart, pie chart, word cloud, word frequency, word length and world map:

[5]:

df_geo = load_dataset('countries')
plot(df_geo, "Country")

[5]:

DataPrep.EDA Report

Stats Bar Chart Pie Chart Word Cloud Word Frequency World Map Value Table

Overview

Approximate Distinct Count	227
Approximate Unique (%)	100.0%
Missing	0
Missing (%)	0.0%
Memory Size	16.6 KB

Length

Mean	9.9251
Standard Deviation	4.1858
Median	9
Minimum	5
Maximum	33

Sample

1st row	Afghanistan
2nd row	Albania
3rd row	Algeria
4th row	American Samoa
5th row	Andorra

Letter

Count	1920
Lowercase Letter	1623
Space Separator	309
Uppercase Letter	297
Dash Punctuation	1
Decimal Number	0

'bar.bars': 10

Maximum number of bars to display

'bar.sort_descending': True

Whether to sort the bars in descending order

'bar.yscale': 'linear'

Y-axis scale ("linear" or "log")

'bar.color': '#1f77b4'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

'pie.slices': 10

Maximum number of pie slices to display

'pie.sort_descending': True

Whether to sort the slices in descending order of frequency

'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896', '#9467bd', '#c5b0d5', '#8c564b']

List of colors

'height': 400

Height of the plot

'width': 450

Width of the plot

'wordcloud.top_words': 30

Maximum number of most frequent words to display

'wordcloud.stopword': True

Whether to remove stopwords

'wordcloud.lemmatize': False

Whether to lemmatize the words

'wordcloud.stem': False

Whether to apply Potter Stem on the words

'height': 400

Height of the plot

'width': 450

Width of the plot

'wordfreq.top_words': 30

Maximum number of most frequent words to display

'wordfreq.stopword': True

Whether to remove stopwords

'wordfreq.lemmatize': False

Whether to lemmatize the words

'wordfreq.stem': False

Whether to apply Potter Stem on the words

'wordfreq.color': #1f77b4

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

The largest value (islands) is over 1.75 times larger than the second largest value (saint)

'value_table.ngroups': 10

The number of distinct values to show

Value	Count	Frequency (%)
Afghanistan	1	0.4%
Albania	1	0.4%
Algeria	1	0.4%
American Samoa	1	0.4%
Andorra	1	0.4%
Angola	1	0.4%
Anguilla	1	0.4%
Antigua & Barbuda	1	0.4%
Argentina	1	0.4%
Armenia	1	0.4%
Other values (217)	217	95.6%

plot(): analyze distributions¶

Overview¶

Load the dataset¶

Get an overview of the dataset with plot(df)¶

Understand a column with plot(df, col1)¶

Overview

Quantile Statistics

Descriptive Statistics

Overview

Length

Sample

Letter

Overview

Length

Sample

Letter

Understand the relationship between two columns with plot(df, col1, col2)¶

`plot()`: analyze distributions¶

Get an overview of the dataset with `plot(df)`¶

Understand a column with `plot(df, col1)`¶

Understand the relationship between two columns with `plot(df, col1, col2)`¶