Insights

This section introduces the insights supported by dataprep

[1]:
%reload_ext autoreload
%autoreload 2
from dataprep.datasets import load_dataset
from dataprep.eda import plot, plot_correlation, plot_missing
[2]:
df = load_dataset("titanic")
[3]:
df
[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

Where is the insights

We give an example in the following: 1. Click the bottom “Show Stats and Insights” 2. We could see the insights provided by Dataprep.

avatar

Then we could find the insights that are provided by Dataprep in this section.

avatar

If we use plot(df, col) function, we have to click the following buttom:

avatar

Then we could see the following insights:

avatar

The insights provided by plot(df)

Here we give an example to show insights that could be provided by plot(df).

insights

applied plots

type

threshold

discription

Duplicates

Overview

int

1

Warn if the percent of duplicated values is above this threshold.

Negatives

Overview

int

1

Warn if the percent of megatives is above this threshold.

Similar_distribution

Overview

float

0.05

The significance level for Kolmogorov–Smirnov test.

Uniform

Histogram

float

0.999

The p-value threshold for chi-square test.

Missing

Histogram

int

1

Warn if the percent of missing values is above this threshold.

Skewed

Histogram

float

1e-5

The p-value for the scipy.skewtest which test whether the skew is different from the normal distributionin.

Infinity

Histogram

int

1

Warn if the percent of infinites is above this threshold.

Zeros

Histogram

int

5

It shows some columns that have zero values larger than the threshold.

Normal

Histogram

float

0.99

The p-value threshold for normal test, it is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality.

High Cardinality

Bar Chart

int

50

The threshold for unique values count, count larger than threshold yields high cardinality.

Constant

Bar Chart

int

1

The threshold for unique values count, count equals to threshold yields constant value.

[4]:
plot(df)
[4]:
DataPrep.EDA Report
Dataset Statistics
Number of Variables 12
Number of Rows 891
Missing Cells 866
Missing Cells (%) 8.1%
Duplicate Rows 0
Duplicate Rows (%) 0.0%
Total Size in Memory 315.0 KB
Average Row Size in Memory 362.1 B
Variable Types
  • Numerical: 3
  • Categorical: 9
Dataset Insights
PassengerId is uniformly distributed Uniform
Age has 177 (19.87%) missing values Missing
Cabin has 687 (77.1%) missing values Missing
Fare is skewed Skewed
Name has a high cardinality: 891 distinct values High Cardinality
Ticket has a high cardinality: 681 distinct values High Cardinality
Cabin has a high cardinality: 147 distinct values High Cardinality
Survived has constant length 1 Constant Length
Pclass has constant length 1 Constant Length
SibSp has constant length 1 Constant Length
Dataset Insights
Parch has constant length 1 Constant Length
Embarked has constant length 1 Constant Length
Name has all distinct values Unique
  • 1
  • 2

The insights provided by plot(df, col) when col is a continues column

Here we give an example to show the insights could be yielded by plot(df, x), when x is a continues column.

insights

applied plots

type

threshold

discription

Infinity

Stats

int

1

Warn if the percent of infinites is above this threshold.

Missing

Stats

int

1

Warn if the percent of missing values is above this threshold.

Negatives

Stats

int

1

Warn if the percent of megatives is above this threshold.

Zeros

Stats

int

5

Warn if the percent of zeros is above this threshold.

Normal

Histogram, Normal Q-Q Plot

float

0.99

The p-value threshold for normal test, it is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality.

Uniform

Histogram

float

0.999

The p-value threshold for chi-square test.

Skewed

Histogram

float

1e-5

The p-value for the scipy.skewtest which test whether the skew is different from the normal distributionin.

Outliers

Box Plot

int

0

It shows how many outliers a column has.

[5]:
plot(df, "Age")
[5]:
DataPrep.EDA Report

Overview

Approximate Distinct Count88
Approximate Unique (%)12.3%
Missing177
Missing (%)19.9%
Infinite0
Infinite (%)0.0%
Memory Size11.2 KB
Mean29.6991
Minimum0.42
Maximum80
Zeros0
Zeros (%)0.0%
Negatives0
Negatives (%)0.0%

Quantile Statistics

Minimum0.42
5-th Percentile4
Q120.125
Median28
Q338
95-th Percentile56
Maximum80
Range79.58
IQR17.875

Descriptive Statistics

Mean29.6991
Standard Deviation14.5265
Variance211.0191
Sum21205.17
Skewness0.3883
Kurtosis0.1686
Coefficient of Variation0.4891
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • Age is skewed right (γ1 = 0.3883)
'kde.bins': 50
Number of bins in the histogram
'kde.yscale': 'linear'
Y-axis scale ("linear" or "log")
'kde.hist_color': '#aec7e8'
Color of the density histogram
'kde.line_color': '#d62728'
Color of the density line
'height': 400
Height of the plot
'width': 450
Width of the plot
'qqnorm.point_color': #1f77b4
Color of the points
'qqnorm.line_color': #d62728
Color of the line
'height': 400
Height of the plot
'width': 450
Width of the plot
'box.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • Age has 11 outliers
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
24.0 30
 
3.4%
22.0 27
 
3.0%
18.0 26
 
2.9%
19.0 25
 
2.8%
28.0 25
 
2.8%
30.0 25
 
2.8%
21.0 24
 
2.7%
25.0 23
 
2.6%
36.0 22
 
2.5%
29.0 20
 
2.2%
Other values (78) 467
52.4%
(Missing) 177
19.9%

The insights provided by plot(df, col) when col is a nominal column

Here we give an example to show the insights could be presented by plot(df, col), when col is a nominal column.

insights

applied plots

type

threshold

discription

Constant

Stats

int

1

The threshold for unique values count, count equals to threshold yields constant value.

High_cardinality

Stats

int

50

The threshold for unique values count, count larger than threshold yields high cardinality.

Missing

Stats

int

1

Warn if the percent of missing values is above this threshold.

Uniform

Bar Chart

float

0.999

The p-value threshold for chi-square test.

Outstanding_no1

Bar Chart

float

1.5

It measures the ratio of the largest category count to the second-largest category count.

Attribution

Pie Chart

float

0.5

It measures the percentage of the top 2 categories.

High_word_cardinality

Word Cloud

int

1000

The threshold for the high word cardinality insight, which measures the number of words of that cateogory.

Outstanding_no1_word

Word Cloud

int

0

The threshold for the outstanding no1 word threshold, which measures the ratio of the most frequent word count to the second most frequent word count.

[6]:
plot(df, "Sex")
[6]:
DataPrep.EDA Report

Overview

Approximate Distinct Count2
Approximate Unique (%)0.2%
Missing0
Missing (%)0.0%
Memory Size60.7 KB

Length

Mean4.7048
Standard Deviation0.956
Median4
Minimum4
Maximum6

Sample

1st rowmale
2nd rowfemale
3rd rowfemale
4th rowfemale
5th rowmale

Letter

Count4192
Lowercase Letter4192
Space Separator0
Uppercase Letter0
Dash Punctuation0
Decimal Number0
'bar.bars': 10
Maximum number of bars to display
'bar.sort_descending': True
Whether to sort the bars in descending order
'bar.yscale': 'linear'
Y-axis scale ("linear" or "log")
'bar.color': '#1f77b4'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The largest value (male) is over 1.84 times larger than the second largest value (female)
'pie.slices': 10
Maximum number of pie slices to display
'pie.sort_descending': True
Whether to sort the slices in descending order of frequency
'pie.colors': ['#1f77b4', '#aec7e8']
List of colors
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The top 2 categories (male, female) take over 50.0%
'wordcloud.top_words': 30
Maximum number of most frequent words to display
'wordcloud.stopword': True
Whether to remove stopwords
'wordcloud.lemmatize': False
Whether to lemmatize the words
'wordcloud.stem': False
Whether to apply Potter Stem on the words
'height': 400
Height of the plot
'width': 450
Width of the plot
'wordfreq.top_words': 30
Maximum number of most frequent words to display
'wordfreq.stopword': True
Whether to remove stopwords
'wordfreq.lemmatize': False
Whether to lemmatize the words
'wordfreq.stem': False
Whether to apply Potter Stem on the words
'wordfreq.color': #1f77b4
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
  • The largest value (male) is over 1.84 times larger than the second largest value (female)
'wordlen.bins': 50
Number of bins in the histogram
'wordlen.yscale': 'linear'
Y-axis scale ("linear" or "log")
'wordlen.color': '#aec7e8'
Color
'height': 400
Height of the plot
'width': 450
Width of the plot
'value_table.ngroups': 10
The number of distinct values to show
Value Count Frequency (%)
male 577
64.8%
female 314
35.2%