dataprep.data_connector

Configuration Manager

Functions for config downloading and maintaining

dataprep.connector.config_manager.config_directory()[source]

Returns the config directory path

Return type

Path

dataprep.connector.config_manager.download_config(impdb, branch)[source]

Download the config from Github into the temp directory.

Return type

None

dataprep.connector.config_manager.ensure_config(impdb, branch, update)[source]

Ensure the config for impdb is downloaded

Return type

bool

dataprep.connector.config_manager.get_git_branch_hash(branch)[source]

Get current config files repo’s hash

Return type

str

dataprep.connector.config_manager.initialize_path(config_path, update)[source]

Determines if the given config_path is local or in GitHub. Fetches the full path.

Return type

Path

dataprep.connector.config_manager.is_obsolete(impdb, branch)[source]

Test if the implicit db config files are obsolete and need to be re-downloaded.

Return type

bool

dataprep.connector.config_manager.separate_branch(config_path)[source]

Separate the config path into db name and branch

Return type

Tuple[str, str]

Connector

This module contains the Connector class. Every data fetching action should begin with instantiating this Connector class.

class dataprep.connector.connector.Connector(config_path, *, update=False, _auth=None, _concurrency=1, **kwargs)[source]

Bases: object

This is the main class of the connector component. Initialize Connector class as the example code.

Parameters
  • config_path (str) – The path to the config. It can be hosted, e.g. “yelp”, or from local filesystem, e.g. “./yelp”

  • _auth (Optional[Dict[str, Any]] = None) – The parameters for authentication, e.g. OAuth2

  • _concurrency (int = 5) – The concurrency setting. By default it is 1 reqs/sec.

  • update (bool = True) – Force update the config file even if the local version exists.

  • **kwargs – Parameters that shared by different queries.

Example

>>> from dataprep.connector import Connector
>>> dc = Connector("yelp", _auth={"access_token": access_token})
info()[source]

Show the basic information and provide guidance for users to issue queries.

Return type

None

async query(table, *, _q=None, _auth=None, _count=None, **where)[source]

Query the API to get a table.

Parameters
  • table (str) – The table name.

  • _q (Optional[str] = None) – Search string to be matched in the response.

  • _auth (Optional[Dict[str, Any]] = None) – The parameters for authentication. Usually the authentication parameters should be defined when instantiating the Connector. In case some tables have different authentication options, a different authentication parameter can be defined here. This parameter will override the one from Connector if passed.

  • _count (Optional[int] = None) – Count of returned records.

  • **where – The additional parameters required for the query.

Return type

Union[Awaitable[DataFrame], DataFrame]

dataprep.connector.connector.populate_field(fields, jenv, params)[source]

Populate a dict based on the fields definition and provided vars.

Return type

Dict[str, str]

dataprep.connector.connector.validate_fields(fields, data)[source]

Check required fields are provided.

Return type

None

Info

This module contains back end functions helping developers use data connector.

dataprep.connector.info.get_schema(schema)[source]

This method returns the schema of the table that will be returned, so that the user knows what information to expect.

Parameters

schema (Dict[str, Any]) – The schema for the table from the config file.

Returns

The returned data’s schema.

Return type

pandas.DataFrame

Note

The schema is defined in the configuration file. The user can either use the default one or change it by editing the configuration file.

dataprep.connector.info.info(config_path, update=False)[source]

Show the basic information and provide guidance for users to issue queries.

Parameters
  • config_path (str) – The path to the config. It can be hosted, e.g. “yelp”, or from local filesystem, e.g. “./yelp”

  • update (bool) – Force update the config file even if the local version exists.

Return type

None

dataprep.connector.info.websites()[source]

Displays names of websites supported by data connector.

Return type

None

Info UI

This module handles displaying information on how to connect and query.

dataprep.connector.info_ui.info_ui(dbname, tbs)[source]

Fills out info.txt template file. Renders the template to an html file.

Parameters
  • dbname (str) – Name of the website

  • tbs (Dict[str, Any]) – Table containing info to be displayed.

Return type

None

Schema

Module contains the loaded config schema.

Implicit database

Module defines ImplicitDatabase and ImplicitTable, where ImplicitDatabase is a conceptual model describes a website and ImplicitTable describes an API endpoint.

class dataprep.connector.implicit_database.ImplicitDatabase(config_path)[source]

Bases: object

A website that provides data can be treat as a database, represented as ImplicitDatabase in DataConnector.

name: str
tables: Dict[str, dataprep.connector.implicit_database.ImplicitTable]
class dataprep.connector.implicit_database.ImplicitTable(name, config)[source]

Bases: object

ImplicitTable class abstracts the request and the response to a Restful API, so that the remote API can be treated as a database table.

config: dataprep.connector.schema.defs.ConfigDef
from_json(data)[source]

Create rows from json string.

Return type

Dict[str, List[Any]]

from_response(payload)[source]

Create a dataframe from a http body payload.

Return type

DataFrame

name: str

Errors

Module defines errors used in this library.

exception dataprep.connector.errors.InvalidAuthParams(params)[source]

Bases: ValueError

The parameters used for Authorization are invalid.

params: Set[str]
exception dataprep.connector.errors.InvalidParameterError(param)[source]

Bases: Exception

The parameter used in the query is invalid

param: str
exception dataprep.connector.errors.MissingRequiredAuthParams(params)[source]

Bases: ValueError

Some parameters for Authorization are missing.

params: Set[str]
exception dataprep.connector.errors.RequestError(status_code, message)[source]

Bases: dataprep.errors.DataprepError

A error indicating the status code of the API response is not 200.

message: str
status_code: int
exception dataprep.connector.errors.UniversalParameterOverridden(param, uparam)[source]

Bases: Exception

The parameter is overrided by the universal parameter

param: str
uparam: str

read_sql

dataprep.connector.read_sql(conn, query, *, return_type='pandas', protocol='binary', partition_on=None, partition_range=None, partition_num=None)[source]

Run the SQL query, download the data from database into a dataframe. Please check out https://github.com/sfu-db/connector-x for more details.

Parameters
  • conn (str) – the connection string.

  • query (Union[List[str], str]) – a SQL query or a list of SQL query.

  • return_type (str) – the return type of this function. It can be “arrow”, “pandas”, “modin”, “dask” or “polars”.

  • protocol (str) – the protocol used to fetch data from source. Valid protocols are database dependent (https://github.com/sfu-db/connector-x/blob/main/Types.md).

  • partition_on (Optional[str]) – the column to partition the result.

  • partition_range (Optional[Tuple[int, int]]) – the value range of the partition column.

  • partition_num (Optional[int]) – how many partition to generate.

Example

>>> db_url = "postgresql://username:password@server:port/database"
>>> query = "SELECT * FROM lineitem"
>>> read_sql(db_url, query, partition_on="partition_col", partition_num=10)
Return type

Any