plot(): analyze distributions

Overview

The function plot() explores the distributions and statistics of the dataset. It generates a variety of visualizations and statistics which enables the user to achieve a comprehensive understanding of the column distributions and their relationships. The following describes the functionality of plot() for a given dataframe df.

  1. plot(df): plots the distribution of each column and computes dataset statistics

  2. plot(df, col1): plots the distribution of column col1 in various ways, and computes its statistics

  3. plot(df, col1, col2): generates plots depicting the relationship between columns col1 and col2

The generated plots are different for numerical, categorical and geography columns. The following table summarizes the output for the different column types.

col1

col2

Output

None

None

dataset statistics, histogram or bar chart for each column

Numerical

None

column statistics, histogram, kde plot, qq-normal plot, box plot

Categorical

None

column statistics, bar chart, pie chart, word cloud, word frequencies

Geography

None

column statistics, bar chart, pie chart, word cloud, word frequencies, world map

Numerical

Numerical

scatter plot, hexbin plot, binned box plot

Numerical

Categorical

categorical box plot, multi-line chart

Categorical

Numerical

categorical box plot, multi-line chart

Categorical

Categorical

nested bar chart, stacked bar chart, heat map

Categorical

Geography

nested bar chart, stacked bar chart, heat map

Geography

Categorical

nested bar chart, stacked bar chart, heat map

Geopoint

Categorical

nested bar chart, stacked bar chart, heat map

Categorical

Geopoint

nested bar chart, stacked bar chart, heat map

Numerical

Geography

categorical box plot, multi-line chart, world map

Geography

Numerical

categorical box plot, multi-line chart, world map

Numerical

Geopoint

geo map

Geopoint

Numerical

geo map

Next, we demonstrate the functionality of plot().

Load the dataset

dataprep.eda supports Pandas and Dask dataframes. Here, we will load the well-known adult dataset into a Pandas dataframe using the load_dataset function.

[1]:
from dataprep.datasets import load_dataset
import numpy as np
df = load_dataset('adult')
df = df.replace(" ?", np.NaN)

Get an overview of the dataset with plot(df)

We start by calling plot(df) which computes dataset-level statistics, a histogram for each numerical column, and a bar chart for each categorical column. The number of bins in the histogram can be specified with the parameter bins, and the number of categories in the bar chart can be specified with the parameter ngroups. If a column contains missing values, the percent of missing values is shown in the title and ignored when generating the plots.

[2]:
from dataprep.eda import plot
plot(df)
[2]:
DataPrep.EDA Report
Dataset Statistics
Number of Variables 15
Number of Rows 48842
Missing Cells 6465
Missing Cells (%) 0.9%
Duplicate Rows 52
Duplicate Rows (%) 0.1%
Total Size in Memory 30.1 MB
Average Row Size in Memory 645.7 B
Variable Types
  • Numerical: 6
  • Categorical: 9
Dataset Insights
workclass has 2799 (5.73%) missing values Missing
occupation has 2809 (5.75%) missing values Missing
native-country has 857 (1.75%) missing values Missing
fnlwgt is skewed Skewed
education-num is skewed Skewed
capital-gain is skewed Skewed
capital-loss is skewed Skewed
hours-per-week is skewed Skewed
capital-gain has 44807 (91.74%) zeros Zeros
capital-loss has 46560 (95.33%) zeros Zeros

Understand a column with plot(df, col1)

After getting an overview of the dataset, we can thoroughly investigate a column of interest col1 using plot(df, col1). The output is of plot(df, col1) is different for numerical and categorical columns.

When col1 is a numerical column, it computes column statistics, and generates a histogram, kde plot, box plot and qq-normal plot:

[3]:
plot(df, "age")
[3]:
DataPrep.EDA Report

Overview

Approximate Distinct Count74
Approximate Unique (%)0.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Memory Size763.2 KB
Mean38.6436
Minimum17
Maximum90
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum17
5-th Percentile19
Q128
Median37
Q348
95-th Percentile63
Maximum90
Range73
IQR20

Descriptive Statistics

Mean38.6436
Standard Deviation13.7105
Variance187.9781
Sum1.8874e+06
Skewness0.5576
Kurtosis-0.1844
Coefficient of Variation0.3548
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • age is skewed right (γ1 = 0.5576)
'kde.bins': 50
Number of bins in the histogram
'kde.yscale': 'linear'
Y-axis scale ("linear" or "log")
'kde.hist_color': '#aec7e8'
Color of the density histogram
'kde.line_color': '#d62728'
Color of the density line
'height': 400
Height of the plot
'width': 450
Width of the plot
'qqnorm.point_color': #1f77b4
Color of the points
'qqnorm.line_color': #d62728
Color of the line
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • age has 216 outliers
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
36 1348
 
2.8%
35 1337
 
2.7%
33 1335
 
2.7%
23 1329
 
2.7%
31 1325
 
2.7%
34 1303
 
2.7%
28 1280
 
2.6%
37 1280
 
2.6%
30 1278
 
2.6%
38 1264
 
2.6%
Other values (64) 35763
73.2%

When x is a categorical column, it computes column statistics, and plots a bar chart, pie chart, word cloud, word frequency and word length:

[4]:
plot(df, "education")
[4]:
DataPrep.EDA Report

Overview

Approximate Distinct Count16
Approximate Unique (%)0.0%
Missing0
Missing (%)0.0%
Memory Size3.5 MB

Length

Mean9.4221
Standard Deviation2.4401
Median8
Minimum4
Maximum13

Sample

1st row 11th
2nd row HS-grad
3rd row Assoc-acdm
4th row Some-college
5th row Some-college

Letter

Count366588
Lowercase Letter308287
Space Separator48842
Uppercase Letter58301
Dash Punctuation32869
Decimal Number11894
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
'bar.color': '#1f77b4'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896', '#9467bd', '#c5b0d5', '#8c564b']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The top 2 categories ( HS-grad, Some-college) take over 50.0%
'wordcloud.top_words': 30
Maximum number of most frequent words to display
'wordcloud.stopword': True
Whether to remove stopwords
'wordcloud.lemmatize': False
Whether to lemmatize the words
'wordcloud.stem': False
Whether to apply Potter Stem on the words
'height': 400
Height of the plot
'width': 450
Width of the plot
'wordfreq.top_words': 30
Maximum number of most frequent words to display
'wordfreq.stopword': True
Whether to remove stopwords
'wordfreq.lemmatize': False
Whether to lemmatize the words
'wordfreq.stem': False
Whether to apply Potter Stem on the words
'wordfreq.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'wordlen.bins': 50
Number of bins in the histogram
'wordlen.yscale': 'linear'
Y-axis scale ("linear" or "log")
'wordlen.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
HS-grad 15784
32.3%
Some-college 10878
22.3%
Bachelors 8025
16.4%
Masters 2657
 
5.4%
Assoc-voc 2061
 
4.2%
11th 1812
 
3.7%
Assoc-acdm 1601
 
3.3%
10th 1389
 
2.8%
7th-8th 955
 
2.0%
Prof-school 834
 
1.7%
Other values (6) 2846
 
5.8%

When x is a Geography column, it computes column statistics, and plots a bar chart, pie chart, word cloud, word frequency, word length and world map:

[5]:
df_geo = load_dataset('countries')
plot(df_geo, "Country")
[5]:
DataPrep.EDA Report

Overview

Approximate Distinct Count227
Approximate Unique (%)100.0%
Missing0
Missing (%)0.0%
Memory Size16.6 KB

Length

Mean9.9251
Standard Deviation4.1858
Median9
Minimum5
Maximum33

Sample

1st rowAfghanistan
2nd rowAlbania
3rd rowAlgeria
4th rowAmerican Samoa
5th rowAndorra

Letter

Count1920
Lowercase Letter1623
Space Separator309
Uppercase Letter297
Dash Punctuation1
Decimal Number0
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
'bar.color': '#1f77b4'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8', '#ff7f0e', '#ffbb78', '#2ca02c', '#98df8a', '#d62728', '#ff9896', '#9467bd', '#c5b0d5', '#8c564b']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot
'wordcloud.top_words': 30
Maximum number of most frequent words to display
'wordcloud.stopword': True
Whether to remove stopwords
'wordcloud.lemmatize': False
Whether to lemmatize the words
'wordcloud.stem': False
Whether to apply Potter Stem on the words
'height': 400
Height of the plot
'width': 450
Width of the plot
'wordfreq.top_words': 30
Maximum number of most frequent words to display
'wordfreq.stopword': True
Whether to remove stopwords
'wordfreq.lemmatize': False
Whether to lemmatize the words
'wordfreq.stem': False
Whether to apply Potter Stem on the words
'wordfreq.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The largest value (islands) is over 1.75 times larger than the second largest value (saint)
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
Afghanistan 1
 
0.4%
Albania 1
 
0.4%
Algeria 1
 
0.4%
American Samoa 1
 
0.4%
Andorra 1
 
0.4%
Angola 1
 
0.4%
Anguilla 1
 
0.4%
Antigua & Barbuda 1
 
0.4%
Argentina 1
 
0.4%
Armenia 1
 
0.4%
Other values (217) 217
95.6%

Understand the relationship between two columns with plot(df, col1, col2)

Next, we can explore the relationship between columns col1 and col2 using plot(df, col1, col2). The output depends on the types of the columns.

When col1 and col2 are both numerical columns, it generates a scatter plot, hexbin plot and box plot:

[6]:
plot(df, "age", "hours-per-week")
[6]:
DataPrep.EDA Report
'scatter.sample_size': 1000
Number of points to randomly sample per partition
'height': 400
Height of the plot
'width': 450
Width of the plot
'hexbin.tile_size': 2.92
Tile size, measured from the middle of the hexagon to the left or right corner
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.bins': 50
Number of bins
'height': 400
Height of the plot
'width': 450
Width of the plot

When col1 and col2 are both categorical columns, it plots a nested bar chart, stacked bar chart and heat map:

[7]:
plot(df, "education", "marital-status")
[7]:
DataPrep.EDA Report
'nested.ngroups': 10
Maximum number of most frequent values in column education to display
'nested.nsubgroups': 5
Maximum number of most frequent values in column marital-status to display (computed on the filtered data consisting of the most frequent values in column education)
'height': 300
Height of the plot
'width': 972
Width of the plot
'stacked.ngroups': 10
Maximum number of most frequent values in column education to display
'stacked.nsubgroups': 5
Maximum number of most frequent values in column marital-status to display (computed on the filtered data consisting of the most frequent values in column education)
'height': 300
Height of the plot
'width': 972
Width of the plot
'heatmap.ngroups': 10
Maximum number of most frequent values in column education to display
'heatmap.nsubgroups': 5
Maximum number of most frequent values in column marital-status to display (computed on the filtered data consisting of the most frequent values in column education)
'height': 300
Height of the plot
'width': 972
Width of the plot

When col1 and col2 are one each of type numerical and categorical, it generates a box plot per category and a multi-line chart:

[8]:
plot(df, "age", "education")
# or plot(df, "education", "age")
[8]:
DataPrep.EDA Report
'box.ngroups': 15
Maximum number of groups to display
'.box.sort_descending': True
Whether to sort the boxes in descending order of frequency
'height': 400
Height of the plot
'width': 450
Width of the plot
'line.ngroups': 10
Maximum number of groups to display
'line.sort_descending': True
Whether to sort the groups in descending order of frequency
'height': 400
Height of the plot
'width': 450
Width of the plot

When col1 and col2 are one each of type geopoint and categorical, or, geography and categorical, it generates a box plot per category and a multi-line chart:

[9]:
from dataprep.eda.dtypes_v2 import LatLong
covid = load_dataset('covid19')
latlong = LatLong("Lat", "Long") # create geopoint type using "LatLong" function by inputing two columns names
plot(covid, latlong, "Country/Region")
# or plot(covid, "Country/Region", latlong)

plot(df_geo,"Country", "Region")
# or plot(df_geo, "Region", "Country")
[9]:
DataPrep.EDA Report
'nested.ngroups': 10
Maximum number of most frequent values in column Country to display
'nested.nsubgroups': 5
Maximum number of most frequent values in column Region to display (computed on the filtered data consisting of the most frequent values in column Country)
'height': 300
Height of the plot
'width': 972
Width of the plot
'stacked.ngroups': 10
Maximum number of most frequent values in column Country to display
'stacked.nsubgroups': 5
Maximum number of most frequent values in column Region to display (computed on the filtered data consisting of the most frequent values in column Country)
'height': 300
Height of the plot
'width': 972
Width of the plot
'heatmap.ngroups': 10
Maximum number of most frequent values in column Country to display
'heatmap.nsubgroups': 5
Maximum number of most frequent values in column Region to display (computed on the filtered data consisting of the most frequent values in column Country)
'height': 300
Height of the plot
'width': 972
Width of the plot

When col1 and col2 are one each of type geography and numerical, it generates a box plot per category, a multi-line chart and a world map:

[10]:
plot(df_geo,"Country", "Population")
# or plot(df_geo, "Population", "Country")
[10]:
DataPrep.EDA Report
'box.ngroups': 15
Maximum number of groups to display
'.box.sort_descending': True
Whether to sort the boxes in descending order of frequency
'height': 400
Height of the plot
'width': 450
Width of the plot
'line.ngroups': 10
Maximum number of groups to display
'line.sort_descending': True
Whether to sort the groups in descending order of frequency
'height': 400
Height of the plot
'width': 450
Width of the plot

When col1 and col2 are one each of type geopoint and numerical, it generates a geo map:

[11]:
plot(covid, latlong, "2/16/2020")
# or plot(covid, "2/16/2020", latlong)
[11]:
DataPrep.EDA Report