dataprep.eda auxiliary modules

Data types

In this module lives the type tree.

class dataprep.eda.dtypes.Categorical[source]

Bases: dataprep.eda.dtypes.DType

Type Categorical

class dataprep.eda.dtypes.Continuous[source]

Bases: dataprep.eda.dtypes.Numerical

Type Continuous, Subtype of Numerical

class dataprep.eda.dtypes.DType[source]

Bases: object

Root of Type Tree

class dataprep.eda.dtypes.DateTime[source]

Bases: dataprep.eda.dtypes.Numerical

Type DateTime, Subtype of Numerical

class dataprep.eda.dtypes.Discrete[source]

Bases: dataprep.eda.dtypes.Numerical

Type Discrete, Subtype of Numerical

class dataprep.eda.dtypes.GeoGraphy[source]

Bases: dataprep.eda.dtypes.Categorical

Type GeoGraphy, Subtype of Categorical

class dataprep.eda.dtypes.GeoPoint[source]

Bases: dataprep.eda.dtypes.DType

Type GeoPoint

class dataprep.eda.dtypes.LatLong(lat_col, long_col)[source]

Bases: dataprep.eda.dtypes.GeoPoint

Type LatLong, Tuple

class dataprep.eda.dtypes.Nominal[source]

Bases: dataprep.eda.dtypes.Categorical

Type Nominal, Subtype of Categorical

class dataprep.eda.dtypes.Numerical[source]

Bases: dataprep.eda.dtypes.DType

Type Numerical

class dataprep.eda.dtypes.Ordinal[source]

Bases: dataprep.eda.dtypes.Categorical

Type Ordinal, Subtype of Categorical

class dataprep.eda.dtypes.Text[source]

Bases: dataprep.eda.dtypes.Nominal

Type Text, Subtype of Nominal

dataprep.eda.dtypes.detect_dtype(col, known_dtype=None, detect_small_distinct=True)[source]

Given a column, detect its type or transform its type according to users’ specification

Parameters
  • col (dask.datafram.Series) – A dataframe column

  • known_dtype (Optional[Union[Dict[str, Union[DType, str]], DType]], default None) – A dictionary or single DType given by users to specify the types for designated columns or all columns. E.g. known_dtype = {“a”: Continuous, “b”: “Nominal”} or known_dtype = {“a”: Continuous(), “b”: “nominal”} or known_dtype = Continuous() or known_dtype = “Continuous” or known_dtype = Continuous()

  • detect_small_distinct (bool, default True) – Whether to detect numerical columns with small distinct values as categorical column.

Return type

DType

dataprep.eda.dtypes.detect_without_known(col, detect_small_distinct)[source]

This function detects dtypes of column when users didn’t specify.

Return type

DType

dataprep.eda.dtypes.drop_null(var)[source]

Drop the null values (specified in NULL_VALUES) from a series or DataFrame

Return type

Union[Series, Series, DataFrame, DataFrame]

dataprep.eda.dtypes.get_dtype_cnts_and_num_cols(df, dtype)[source]

Get the count of each dtype in a dataframe

Return type

Tuple[Dict[str, int], List[str]]

dataprep.eda.dtypes.is_continuous(dtype)[source]

Given a type, return if that type is a continuous type

Return type

bool

dataprep.eda.dtypes.is_datetime(dtype)[source]

Given a type, return if that type is a datetime type

Return type

bool

dataprep.eda.dtypes.is_dtype(dtype1, dtype2)[source]

This function detects if dtype2 is dtype1.

Return type

bool

dataprep.eda.dtypes.is_geography(col)[source]

Given a column, return if its type is a geography type

Return type

bool

dataprep.eda.dtypes.is_geopoint(col)[source]

Given a column, return if its type is a geopoint type

Return type

bool

dataprep.eda.dtypes.is_nominal(dtype)[source]

Given a type, return if that type is a nominal type

Return type

bool

dataprep.eda.dtypes.is_pandas_categorical(dtype)[source]

Detect if a dtype is categorical and from pandas.

Return type

bool

dataprep.eda.dtypes.map_dtype(dtype)[source]

Currently, we want to keep our Type System flattened. We will map Categorical() to Nominal() and Numerical() to Continuous()

Return type

DType

dataprep.eda.dtypes.normalize_dtype(dtype_repr)[source]

This function normalizes a dtype repr.

Return type

DType

Intermediate

Intermediate class

class dataprep.eda.intermediate.ColumnMetadata(meta)[source]

Bases: object

Container for storing a single column’s metadata. This is immutable.

metadata: pandas.core.series.Series
class dataprep.eda.intermediate.ColumnsMetadata[source]

Bases: object

Container for storing each column’s metadata.

metadata: pandas.core.frame.DataFrame
class dataprep.eda.intermediate.Intermediate(*args, **kwargs)[source]

Bases: Dict[str, Any]

This class contains intermediate results.

save(path=None)[source]

Save intermediate to current working directory.

Parameters
  • filename (Optional[str], default 'intermediate') – The filename used for saving intermediate without the extension name.

  • to (Optional[str], default Path.cwd()) – The path to where the intermediate will be saved.

Return type

None

visual_type: str

Palette

This file defines palettes used for EDA.

Container

This module implements the Container class.

class dataprep.eda.container.Container(to_render, visual_type, cfg)[source]

Bases: object

This class creates a customized Container object for the plot* function.

save(filename)[source]

save function

Return type

None

show()[source]

Render the report. This is useful when calling plot in a for loop.

Return type

None

show_browser()[source]

Open the plot in the browser. This is useful when plotting from terminmal or when the fig is very large in notebook.

Return type

None

class dataprep.eda.container.Context(**param)[source]

Bases: object

Define the context class that stores all the parameters needed by template engine. The instance is read-only.

Since we use same template to render different components without strict evaluation, when the engine tries to read an attribute from Context object, it will get None if the attribute doesn’t exist, making the rendering keep going instead of being interrupted.

Here we override __getitem__() and __getattr__() to do the trick, it also makes the object have a key-value pair which works as same as accessing its attribute.

Utils

Miscellaneous functions

dataprep.eda.utils.cut_long_name(name, max_len=18)[source]

If the name is longer than max_len, cut it to max_len length and append “…

Return type

str

dataprep.eda.utils.fuse_missing_perc(name, perc)[source]

Append (x.y%) to the name if perc is not 0.

Return type

str

dataprep.eda.utils.preprocess_dataframe(org_df, used_columns=None, excluded_columns=None, detect_small_distinct=True)[source]

Make a dask dataframe with only used_columns. This function will do the following:

  1. keep only used_columns.

2. transform column name to string (avoid object column name) and rename duplicate column names in form of {col}_{id}. 3. reset index 4. transform object column to string column (note that obj column can contain cells from different type). 5. transform to dask dataframe if input is pandas dataframe.

Parameters
  • org_df (dataframe) – the original dataframe

  • used_columns (optional list[str], default None) – used columns in org_df

  • excluded_columns (optional list[str], default None) – excluded columns from used_columns, mainly used for geo point data processing.

  • detect_small_distinct (bool, default True) – whether to detect numerical columns with small distinct values as categorical column.

Return type

DataFrame

dataprep.eda.utils.relocate_legend(fig, loc)[source]

Relocate legend(s) from center to loc.

Return type

Figure

dataprep.eda.utils.sample_n(arr, n)[source]

Sample n values uniformly from the range of the arr, not from the distribution of arr’s elems.

Return type

ndarray

dataprep.eda.utils.to_dask(df)[source]

Convert a dataframe to a dask dataframe.

Return type

DataFrame

dataprep.eda.utils.tweak_figure(fig, ptype=None, show_yticks=False, max_lbl_len=15)[source]

Set some common attributes for a figure

Return type

None

Config

Parameter configurations

This file contains configurations for stats, auto-insights and plots. There are mainly two settings, “display” and “config”. Display is a list of Tab names which control the Tabs to show. Config is a dictionary that contains the customizable parameters and corresponding values. There are two types of parameters, global and local. Local parameters are plot-specified and the names are separated by “.”. The portion before the first “.” is plot name and the portion after the first “.” is parameter name. e.g. “hist.bins”. The “.” is also used when the parameter name contains more than one word. e.g. “insight.duplicates.threshold”. However, in the codebase, the “.” is replaced with “__” for parameters with long names.e.g. “insight.duplicates__threshold”. Global parameter is single-word. It applies to all the plots which has that parameter. e.g. “bins:50” applies to “hist.bins”, “line.bins”, “kde.bins”, “wordlen.bins” and “box.bins”. In addition,when global parameter and local parameter are both entered by a user in config, the global parameter will be overwrote by local parameters for specific plots.

class dataprep.eda.configs.Bar(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

bars: int, default 10

Maximum number of bars to display

sort_descending: bool, default True

Whether to sort the bars in descending order

yscale: str, default “linear”

Y-axis scale (“linear” or “log”)

color: str, default “#1f77b4”

Color of the bar chart

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

bars: int
color: str
enable: bool
grid_how_to_guide()[source]

how-to guide for plot(df)

Return type

List[Tuple[str, str]]

height: Optional[int]
how_to_guide(height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

missing_how_to_guide(height, width)[source]

how-to guide for plot_missing(df, x, [y])

Return type

List[Tuple[str, str]]

sort_descending: bool
width: Optional[int]
yscale: str
class dataprep.eda.configs.Box(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

ngroups: int, default 15

Maximum number of groups to display

bins: int, default 50

Number of bins

unit: str, default “auto”

Defines the time unit to group values over for a datetime column. It can be “year”, “quarter”, “month”, “week”, “day”, “hour”, “minute”, “second”. With default value “auto”, it will use the time unit such that the resulting number of groups is closest to 15

sort_descending: bool, default True

Whether to sort the boxes in descending order of frequency

color: str, default “#d62728

Color of the box_plot

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

bins: int
color: str
enable: bool
height: Optional[int]
ngroups: int
nom_cont_how_to_guide(height, width)[source]

how-to guide for plot(df, nominal, continuous)

Return type

List[Tuple[str, str]]

sort_descending: bool
two_cont_how_to_guide(height, width)[source]

how-to guide for plot(df, continuous, continuous)

Return type

List[Tuple[str, str]]

unit: str
univar_how_to_guide(height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

width: Optional[int]
class dataprep.eda.configs.CDF(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

sample_size:

Number of evenly spaced samples between the minimum and maximum values to compute the cdf at

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

sample_size: int
width: Optional[int]
class dataprep.eda.configs.Config(**data)[source]

Bases: pydantic.main.BaseModel

Configuration class

bar: dataprep.eda.configs.Bar
box: dataprep.eda.configs.Box
cdf: dataprep.eda.configs.CDF
correlations: dataprep.eda.configs.Correlations
dendro: dataprep.eda.configs.Dendrogram
diff: dataprep.eda.configs.Diff
classmethod from_dict(display=None, config=None)[source]

Converts an dictionary instance into a config class

Return type

Config

heatmap: dataprep.eda.configs.Heatmap
hexbin: dataprep.eda.configs.Hexbin
hist: dataprep.eda.configs.Hist
insight: dataprep.eda.configs.Insight
interactions: dataprep.eda.configs.Interactions
kde: dataprep.eda.configs.KDE
kendall: dataprep.eda.configs.KendallTau
line: dataprep.eda.configs.Line
missingvalues: dataprep.eda.configs.MissingValues
nested: dataprep.eda.configs.Nested
overview: dataprep.eda.configs.Overview
pdf: dataprep.eda.configs.PDF
pearson: dataprep.eda.configs.Pearson
pie: dataprep.eda.configs.Pie
plot: dataprep.eda.configs.Plot
qqnorm: dataprep.eda.configs.QQNorm
scatter: dataprep.eda.configs.Scatter
spearman: dataprep.eda.configs.Spearman
spectrum: dataprep.eda.configs.Spectrum
stacked: dataprep.eda.configs.Stacked
stats: dataprep.eda.configs.Stats
value_table: dataprep.eda.configs.ValueTable
variables: dataprep.eda.configs.Variables
wordcloud: dataprep.eda.configs.WordCloud
wordfreq: dataprep.eda.configs.WordFrequency
wordlen: dataprep.eda.configs.WordLength
class dataprep.eda.configs.Correlations(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

value_range

If the correlation value is out of the range, don’t show it.

k

Choose top-k element

enable: bool
how_to_guide()[source]

how-to guide for plot_correlation(df, x)

Return type

List[Tuple[str, str]]

k: Optional[int]
value_range: Optional[Tuple[float, float]]
class dataprep.eda.configs.Dendrogram(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

width: Optional[int]
class dataprep.eda.configs.Diff(**data)[source]

Bases: pydantic.main.BaseModel

Define the parameters in the plot_diff

baseline: int
density: bool
label: Optional[List[str]]
class dataprep.eda.configs.Heatmap(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

ngroups: int, default 10

Maximum number of most frequent values from the first column to display

nsubgroups: int, default 5

Maximum number of most frequent values from the second column to display (computed on the filtered data consisting of the most frequent values from the first column)

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(x, y, height, width)[source]

how-to guide for plot(df, nominal, nominal)

Return type

List[Tuple[str, str]]

missing_how_to_guide(height, width)[source]

how-to guide for plot_missing(df)

Return type

List[Tuple[str, str]]

ngroups: int
nsubgroups: int
width: Optional[int]
class dataprep.eda.configs.Hexbin(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

tile_size: float, default “auto”

The size of the tile in the hexbin plot. Measured from the middle of a hexagon to its left or right corner.

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(tile_size, height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

tile_size: str
width: Optional[int]
class dataprep.eda.configs.Hist(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

bins: int, default 50

Number of bins in the histogram

yscale: str, default “linear”

Y-axis scale (“linear” or “log”)

color: str, default “#aec7e8”

Color of the histogram

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

bins: int
color: str
enable: bool
grid_how_to_guide()[source]

how-to guide for plot(df)

Return type

List[Tuple[str, str]]

height: Optional[int]
how_to_guide(height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

width: Optional[int]
yscale: str
class dataprep.eda.configs.Insight(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

duplicates__threshold: int, default 1

Warn if the percent of duplicated values is above this threshold

similar_distribution__threshold:float, default 0.05

The significance level for Kolmogorov–Smirnov test

uniform__threshold: float, default 0.999

The p-value threshold for chi-square test

missing__threshold: int, default 1

Warn if the percent of missing values is above this threshold

skewed__threshold: float, default 1e-5

The p-value for the scipy.skewtest which test whether the skew is different from the normal distributionin

infinity__threshold: int, default 1

Warn if the percent of infinites is above this threshold

zeros__threshold: int, default 5

Warn if the percent of zeros is above this threshold

negatives__threshold: int, default 1

Warn if the percent of negatives is above this threshold

normal__threshold: float, default 0.99

The p-value threshold for normal test, it is based on D’Agostino and Pearson’s test that combines skew and kurtosis to produce an omnibus test of normality

high_cardinality__threshold: int, default 50

The threshold for unique values count, count larger than threshold yields high cardinality

constant__threshold: int, default 1

The threshold for unique values count, count equals to threshold yields constant value

outstanding_no1__threshold: float, default 1.5

The threshold for outstanding no1 insight, measures the ratio of the largest category count to the second-largest category count

attribution__threshold: float, default 0.5

The threshold for the attribution insight, measures the percentage of the top 2 categories

high_word_cardinality__threshold: int, default 1000

The threshold for the high word cardinality insight, which measures the number of words of that cateogory

outstanding_no1_word__threshold: int, default 0

The threshold for the outstanding no1 word threshold, which measures the ratio of the most frequent word count to the second most frequent word count

outlier__threshold: int, default 0

The threshold for the outlier count in the box plot

attribution__threshold: float
constant__threshold: int
duplicates__threshold: int
enable: bool
high_cardinality__threshold: int
high_word_cardinality__threshold: int
infinity__threshold: int
missing__threshold: int
negatives__threshold: int
normal__threshold: float
outlier__threshold: int
outstanding_no1__threshold: float
outstanding_no1_word__threshold: float
similar_distribution__threshold: float
skewed__threshold: float
uniform__threshold: float
zeros__threshold: int
class dataprep.eda.configs.Interactions(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

enable: bool
class dataprep.eda.configs.KDE(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

bins: int, default 50

Number of bins in the histogram

yscale: str, default “linear”

Y-axis scale (“linear” or “log”)

hist_color: str, default “#aec7e8”

Color of the density histogram

line_color: str, default “#d62728

Color of the density line

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

bins: int
enable: bool
height: Optional[int]
hist_color: str
how_to_guide(height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

line_color: str
width: Optional[int]
yscale: str
class dataprep.eda.configs.KendallTau(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

width: Optional[int]
class dataprep.eda.configs.Line(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

bins: int, default 50

Number of bins

ngroups: int, default 10

Maximum number of groups to display

sort_descending: bool, default True

Whether to sort the groups in descending order of frequency

yscale: str, default “linear”

The scale to show on the y axis. Can be “linear” or “log”.

unit: str, default “auto”

Defines the time unit to group values over for a datetime column. It can be “year”, “quarter”, “month”, “week”, “day”, “hour”, “minute”, “second”. With default value “auto”, it will use the time unit such that the resulting number of groups is closest to 15

agg: str, default “mean”

Specify the aggregate to use when aggregating over a numeric column

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

agg: str
bins: int
enable: bool
height: Optional[int]
ngroups: int
nom_cont_how_to_guide(height, width)[source]

how-to guide for plot(df, nominal, continuous)

Return type

List[Tuple[str, str]]

sort_descending: bool
unit: str
width: Optional[int]
yscale: str
class dataprep.eda.configs.MissingValues(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

enable: bool
class dataprep.eda.configs.Nested(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

ngroups: int, default 10

Maximum number of most frequent values from the first column to display

nsubgroups: int, default 5

Maximum number of most frequent values from the second column to display (computed on the filtered data consisting of the most frequent values from the first column)

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(x, y, height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

ngroups: int
nsubgroups: int
width: Optional[int]
class dataprep.eda.configs.Overview(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

enable: bool
class dataprep.eda.configs.PDF(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

sample_size: int, default 100

Number of evenly spaced samples between the minimum and maximum values to compute the pdf at

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

sample_size: int
width: Optional[int]
class dataprep.eda.configs.Pearson(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

width: Optional[int]
class dataprep.eda.configs.Pie(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

slices: int, default 10

Maximum number of pie slices to display

sort_descending: bool, default True

Whether to sort the slices in descending order of frequency

colors: Optional[List[str]], default None

List of colors

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

colors: Optional[List[str]]
enable: bool
height: Optional[int]
how_to_guide(color_list, height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

slices: int
sort_descending: bool
width: Optional[int]
class dataprep.eda.configs.Plot(**data)[source]

Bases: pydantic.main.BaseModel

Class containing global parameters for the plots

bins: Optional[int]
height: Optional[int]
ngroups: Optional[int]
report: bool
width: Optional[int]
class dataprep.eda.configs.QQNorm(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

point_color: str, default “#1f77b4”

Color of the density histogram

line_color: str, default “#d62728

Color of the density line

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

line_color: str
point_color: str
width: Optional[int]
class dataprep.eda.configs.Scatter[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

sample_size: int, optional, default=1000

Number of points to randomly sample per partition. Cannot be used with sample_rate.

sample_rate: float, optional, default None

sample rate per partition. Cannot be used with sample_size. Set it to 1.0 for no sampling.

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

sample_rate: Optional[float]
sample_size: Optional[int]
width: Optional[int]
class dataprep.eda.configs.Spearman(**data)[source]

Bases: pydantic.main.BaseModel

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

width: Optional[int]
class dataprep.eda.configs.Spectrum(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

bins: int, default 20

Number of bins

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

bins: int
enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

width: Optional[int]
class dataprep.eda.configs.Stacked(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

ngroups: int, default 10

Maximum number of most frequent values from the first column to display

nsubgroups: int, default 5

Maximum number of most frequent values from the second column to display (computed on the filtered data consisting of the most frequent values from the first column)

unit: str, default “auto”

Defines the time unit to group values over for a datetime column. It can be “year”, “quarter”, “month”, “week”, “day”, “hour”, “minute”, “second”. With default value “auto”, it will use the time unit such that the resulting number of groups is closest to 15

sort_descending: bool, default True

Whether to sort the groups in descending order of frequency

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

enable: bool
height: Optional[int]
how_to_guide(x, y, height, width)[source]

how-to guide

Return type

List[Tuple[str, str]]

ngroups: int
nsubgroups: int
sort_descending: bool
unit: str
width: Optional[int]
class dataprep.eda.configs.Stats(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to display the stats section

enable: bool
class dataprep.eda.configs.ValueTable(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

ngroups: int, default 10

Number of values to show in the table

enable: bool
how_to_guide()[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

ngroups: int
class dataprep.eda.configs.Variables(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

enable: bool
class dataprep.eda.configs.WordCloud(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

top_words: int, default 30

Maximum number of most frequent words to display

stopword: bool, default True

Whether to remove stopwords

lemmatize: bool, default False

Whether to lemmatize the words

stem: bool, default False

Whether to apply Potter Stem on the words

enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

lemmatize: bool
stem: bool
stopword: bool
top_words: int
width: Optional[int]
class dataprep.eda.configs.WordFrequency(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

top_words: int, default 30

Maximum number of most frequent words to display

stopword: bool, default True

Whether to remove stopwords

lemmatize: bool, default False

Whether to lemmatize the words

stem: bool, default False

Whether to apply Potter Stem on the words

color: str, default “#1f77b4”

Color of the bar chart

color: str
enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

lemmatize: bool
stem: bool
stopword: bool
top_words: int
width: Optional[int]
class dataprep.eda.configs.WordLength(**data)[source]

Bases: pydantic.main.BaseModel

enable: bool, default True

Whether to create this element

bins: int, default 50

Number of bins in the histogram

yscale: str, default “linear”

Y-axis scale (“linear” or “log”)

color: str, default “#aec7e8”

Color of the histogram

height: int, default “auto”

Height of the plot

width: int, default “auto”

Width of the plot

bins: int
color: str
enable: bool
height: Optional[int]
how_to_guide(height, width)[source]

how-to guide for plot(df, x)

Return type

List[Tuple[str, str]]

width: Optional[int]
yscale: str