Insights¶

This section introduces the insights supported by dataprep

[1]:

%reload_ext autoreload
%autoreload 2
from dataprep.datasets import load_dataset
from dataprep.eda import plot, plot_correlation, plot_missing

[2]:

df = load_dataset("titanic")

[3]:

df

[3]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...	...
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.0000	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.0000	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.4500	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

891 rows × 12 columns

Where is the insights¶

We give an example in the following: 1. Click the bottom “Show Stats and Insights” 2. We could see the insights provided by Dataprep.

avatar

Then we could find the insights that are provided by Dataprep in this section.

avatar

If we use plot(df, col) function, we have to click the following buttom:

avatar

Then we could see the following insights:

avatar

The insights provided by plot(df)¶

Here we give an example to show insights that could be provided by plot(df).

insights	applied plots	type	threshold	discription
Duplicates	Overview	int	1	Warn if the percent of duplicated values is above this threshold.
Negatives	Overview	int	1	Warn if the percent of megatives is above this threshold.
Similar_distribution	Overview	float	0.05	The significance level for Kolmogorov–Smirnov test.
Uniform	Histogram	float	0.999	The p-value threshold for chi-square test.
Missing	Histogram	int	1	Warn if the percent of missing values is above this threshold.
Skewed	Histogram	float	1e-5	The p-value for the scipy.skewtest which test whether the skew is different from the normal distributionin.
Infinity	Histogram	int	1	Warn if the percent of infinites is above this threshold.
Zeros	Histogram	int	5	It shows some columns that have zero values larger than the threshold.
Normal	Histogram	float	0.99	The p-value threshold for normal test, it is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality.
High Cardinality	Bar Chart	int	50	The threshold for unique values count, count larger than threshold yields high cardinality.
Constant	Bar Chart	int	1	The threshold for unique values count, count equals to threshold yields constant value.

[4]:

plot(df)

[4]:

DataPrep.EDA Report

Stats and Insights

Dataset Statistics

Number of Variables	12
Number of Rows	891
Missing Cells	866
Missing Cells (%)	8.1%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	315.0 KB
Average Row Size in Memory	362.1 B
Variable Types	Numerical: 3 Categorical: 9

Dataset Insights

PassengerId is uniformly distributed	Uniform
Age has 177 (19.87%) missing values	Missing
Cabin has 687 (77.1%) missing values	Missing
Fare is skewed	Skewed
Name has a high cardinality: 891 distinct values	High Cardinality
Ticket has a high cardinality: 681 distinct values	High Cardinality
Cabin has a high cardinality: 147 distinct values	High Cardinality
Survived has constant length 1	Constant Length
Pclass has constant length 1	Constant Length
SibSp has constant length 1	Constant Length

Dataset Insights

Parch has constant length 1	Constant Length
Embarked has constant length 1	Constant Length
Name has all distinct values	Unique

1
2

The insights provided by plot(df, col) when col is a continues column¶

Here we give an example to show the insights could be yielded by plot(df, x), when x is a continues column.

insights	applied plots	type	threshold	discription
Infinity	Stats	int	1	Warn if the percent of infinites is above this threshold.
Missing	Stats	int	1	Warn if the percent of missing values is above this threshold.
Negatives	Stats	int	1	Warn if the percent of megatives is above this threshold.
Zeros	Stats	int	5	Warn if the percent of zeros is above this threshold.
Normal	Histogram, Normal Q-Q Plot	float	0.99	The p-value threshold for normal test, it is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality.
Uniform	Histogram	float	0.999	The p-value threshold for chi-square test.
Skewed	Histogram	float	1e-5	The p-value for the scipy.skewtest which test whether the skew is different from the normal distributionin.
Outliers	Box Plot	int	0	It shows how many outliers a column has.

[5]:

plot(df, "Age")

[5]:

DataPrep.EDA Report

Stats Histogram KDE Plot Normal Q-Q Plot Box Plot Value Table

Overview

Approximate Distinct Count	88
Approximate Unique (%)	12.3%
Missing	177
Missing (%)	19.9%
Infinite	0
Infinite (%)	0.0%
Memory Size	11.2 KB
Mean	29.6991
Minimum	0.42
Maximum	80
Zeros	0
Zeros (%)	0.0%
Negatives	0
Negatives (%)	0.0%

Quantile Statistics

Minimum	0.42
5-th Percentile	4
Q1	20.125
Median	28
Q3	38
95-th Percentile	56
Maximum	80
Range	79.58
IQR	17.875

Descriptive Statistics

Mean	29.6991
Standard Deviation	14.5265
Variance	211.0191
Sum	21205.17
Skewness	0.3883
Kurtosis	0.1686
Coefficient of Variation	0.4891

'hist.bins': 50

Number of bins in the histogram

'hist.yscale': 'linear'

Y-axis scale ("linear" or "log")

'hist.color': '#aec7e8'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

Age is skewed right (γ1 = 0.3883)

'kde.bins': 50

Number of bins in the histogram

'kde.yscale': 'linear'

Y-axis scale ("linear" or "log")

'kde.hist_color': '#aec7e8'

Color of the density histogram

'kde.line_color': '#d62728'

Color of the density line

'height': 400

Height of the plot

'width': 450

Width of the plot

'qqnorm.point_color': #1f77b4

Color of the points

'qqnorm.line_color': #d62728

Color of the line

'height': 400

Height of the plot

'width': 450

Width of the plot

'box.color': #1f77b4

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

Age has 11 outliers

'value_table.ngroups': 10

The number of distinct values to show

Value	Count	Frequency (%)
24.0	30	3.4%
22.0	27	3.0%
18.0	26	2.9%
19.0	25	2.8%
28.0	25	2.8%
30.0	25	2.8%
21.0	24	2.7%
25.0	23	2.6%
36.0	22	2.5%
29.0	20	2.2%
Other values (78)	467	52.4%
(Missing)	177	19.9%

The insights provided by plot(df, col) when col is a nominal column¶

Here we give an example to show the insights could be presented by plot(df, col), when col is a nominal column.

insights	applied plots	type	threshold	discription
Constant	Stats	int	1	The threshold for unique values count, count equals to threshold yields constant value.
High_cardinality	Stats	int	50	The threshold for unique values count, count larger than threshold yields high cardinality.
Missing	Stats	int	1	Warn if the percent of missing values is above this threshold.
Uniform	Bar Chart	float	0.999	The p-value threshold for chi-square test.
Outstanding_no1	Bar Chart	float	1.5	It measures the ratio of the largest category count to the second-largest category count.
Attribution	Pie Chart	float	0.5	It measures the percentage of the top 2 categories.
High_word_cardinality	Word Cloud	int	1000	The threshold for the high word cardinality insight, which measures the number of words of that cateogory.
Outstanding_no1_word	Word Cloud	int	0	The threshold for the outstanding no1 word threshold, which measures the ratio of the most frequent word count to the second most frequent word count.

[6]:

plot(df, "Sex")

[6]:

DataPrep.EDA Report

Stats Bar Chart Pie Chart Word Cloud Word Frequency Word Length Value Table

Overview

Approximate Distinct Count	2
Approximate Unique (%)	0.2%
Missing	0
Missing (%)	0.0%
Memory Size	60.7 KB

Length

Mean	4.7048
Standard Deviation	0.956
Median	4
Minimum	4
Maximum	6

Sample

1st row	male
2nd row	female
3rd row	female
4th row	female
5th row	male

Letter

Count	4192
Lowercase Letter	4192
Space Separator	0
Uppercase Letter	0
Dash Punctuation	0
Decimal Number	0

'bar.bars': 10

Maximum number of bars to display

'bar.sort_descending': True

Whether to sort the bars in descending order

'bar.yscale': 'linear'

Y-axis scale ("linear" or "log")

'bar.color': '#1f77b4'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

The largest value (male) is over 1.84 times larger than the second largest value (female)

'pie.slices': 10

Maximum number of pie slices to display

'pie.sort_descending': True

Whether to sort the slices in descending order of frequency

'pie.colors': ['#1f77b4', '#aec7e8']

List of colors

'height': 400

Height of the plot

'width': 450

Width of the plot

The top 2 categories (male, female) take over 50.0%

'wordcloud.top_words': 30

Maximum number of most frequent words to display

'wordcloud.stopword': True

Whether to remove stopwords

'wordcloud.lemmatize': False

Whether to lemmatize the words

'wordcloud.stem': False

Whether to apply Potter Stem on the words

'height': 400

Height of the plot

'width': 450

Width of the plot

'wordfreq.top_words': 30

Maximum number of most frequent words to display

'wordfreq.stopword': True

Whether to remove stopwords

'wordfreq.lemmatize': False

Whether to lemmatize the words

'wordfreq.stem': False

Whether to apply Potter Stem on the words

'wordfreq.color': #1f77b4

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

The largest value (male) is over 1.84 times larger than the second largest value (female)

'wordlen.bins': 50

Number of bins in the histogram

'wordlen.yscale': 'linear'

Y-axis scale ("linear" or "log")

'wordlen.color': '#aec7e8'

Color

'height': 400

Height of the plot

'width': 450

Width of the plot

'value_table.ngroups': 10

The number of distinct values to show

Value	Count	Frequency (%)
male	577	64.8%
female	314	35.2%