dataprep.eda.diff

plot_diff

dataprep.eda.diff.plot_diff(df, x=None, config=None, display=None, dtype=None, progress=False)[source]

This function is to compute and visualize the differences between 2 or more(up to 5) datasets.

Parameters
  • df (Union[List[Union[DataFrame, DataFrame]], DataFrame, DataFrame]) – The DataFrame(s) to be compared.

  • x (Optional[str]) – The column to be emphasized in the comparision.

  • config (Optional[Dict[str, Any]]) – A dictionary for configuring the visualizations E.g. config={“hist.bins”: 20}

  • display (Optional[List[str]]) – A list containing the names of the visualizations to display E.g. display=[“Histogram”]

  • dtype (str or DType or dict of str or dict of DType, default None) – Specify Data Types for designated column or all columns. E.g. dtype = {“a”: Continuous, “b”: “Nominal”} or dtype = {“a”: Continuous(), “b”: “nominal”} or dtype = Continuous() or dtype = “Continuous” or dtype = Continuous().

  • progress (bool) – Whether to show the progress bar.

Examples

>>> from dataprep.datasets import load_dataset
>>> from dataprep.eda import plot_diff
>>> df_train = load_dataset('house_prices_train')
>>> df_test = load_dataset('house_prices_test')
>>> plot_diff([df_train, df_test])
Return type

Container

compute_diff

Computations for plot_diff([df…]).

dataprep.eda.diff.compute.compute_diff(df, x=None, *, cfg=None, display=None, dtype=None)[source]

All in one compute function.

Parameters
  • df (Union[List[Union[DataFrame, DataFrame]], DataFrame, DataFrame]) – DataFrame from which visualizations are generated

  • cfg (Union[Config, Dict[str, Any], None], default None) – When a user call plot(), the created Config object will be passed to compute(). When a user call compute() directly, if he/she wants to customize the output, cfg is a dictionary for configuring. If not, cfg is None and default values will be used for parameters.

  • display (Optional[List[str]], default None) – A list containing the names of the visualizations to display. Only exist when a user call compute() directly and want to customize the output

  • x (Optional[str], default None) – A valid column name from the dataframe

  • dtype (str or DType or dict of str or dict of DType, default None) – Specify Data Types for designated column or all columns. E.g. dtype = {“a”: Continuous, “b”: “Nominal”} or dtype = {“a”: Continuous(), “b”: “nominal”} or dtype = Continuous() or dtype = “Continuous” or dtype = Continuous()

Return type

Intermediate

render_diff

This module implements the visualization for the plot_diff function.

dataprep.eda.diff.render.render_diff(itmdt, cfg)[source]

Render a basic plot

Parameters
  • itmdt (Intermediate) – The Intermediate containing results from the compute function.

  • cfg (Config) – Config instance

Return type

Dict[str, Any]