`read_sql()` function¶

The read_sql function issues a SQL query to a specific database and return the result in a Python DataFrame. In this section, we show how to use this feature.

Install `connectorx`¶

Connector wrap up the function on connectorx, in order to enable read_sql, you need to first install it by running:

pip install connectorx

API¶

read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Run the SQL query, download the data from database into a Pandas dataframe.

Parameters¶

conn: str: Connection string URI.
- General supported URI scheme: (postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname.
- For now sqlite only support absolute path, example: sqlite:///home/user/path/test.db.
query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.
return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.
protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
partition_on: Optional[str]: The column to partition the result.
partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
partition_num: Optioinal[int]: The number of partitions to generate.

Examples¶

Read a DataFrame from a SQL using a single thread

from dataprep.connector import read_sql

postgres_url = "postgresql://username:password@server:port/database"
query = "SELECT * FROM lineitem"

read_sql(postgres_url, query)

Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

from dataprep.connector import read_sql

postgres_url = "postgresql://username:password@server:port/database"
query = "SELECT * FROM lineitem"

read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)

Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

from dataprep.connector import read_sql

postgres_url = "postgresql://username:password@server:port/database"
queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]

read_sql(postgres_url, queries)

Read a DataFrame parallelly using 4 threads from a more complex query

from dataprep.connector import read_sql

postgres_url = "postgresql://username:password@server:port/database"
query = f"""
SELECT l_orderkey,
       SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
       o_orderdate,
       o_shippriority
FROM   customer,
       orders,
       lineitem
WHERE  c_mktsegment = 'BUILDING'
       AND c_custkey = o_custkey
       AND l_orderkey = o_orderkey
       AND o_orderdate < DATE '1995-03-15'
       AND l_shipdate > DATE '1995-03-15'
GROUP  BY l_orderkey,
          o_orderdate,
          o_shippriority
"""

read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)

read_sql() function¶

Install connectorx¶

API¶

Parameters¶

Examples¶

`read_sql()` function¶

Install `connectorx`¶