read_sql() function

The read_sql function issues a SQL query to a specific database and return the result in a Python DataFrame. In this section, we show how to use this feature.

Install connectorx

Connector wrap up the function on connectorx, in order to enable read_sql, you need to first install it by running:

pip install connectorx

API

read_sql(conn: str, query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Run the SQL query, download the data from database into a Pandas dataframe.

Parameters

  • conn: str: Connection string URI.

    • General supported URI scheme: (postgres|postgressql|mysql|mssql)://username:password@addr:port/dbname.

    • For now sqlite only support absolute path, example: sqlite:///home/user/path/test.db.

  • query: Union[str, List[str]]: SQL query or list of SQL queries for fetching data.

  • return_type: str = "pandas": The return type of this function. It can be arrow, pandas, modin, dask or polars.

  • protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.

  • partition_on: Optional[str]: The column to partition the result.

  • partition_range: Optional[Tuple[int, int]]: The value range of the partition column.

  • partition_num: Optioinal[int]: The number of partitions to generate.

Examples

  • Read a DataFrame from a SQL using a single thread

    from dataprep.connector import read_sql
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    read_sql(postgres_url, query)
    
  • Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

    from dataprep.connector import read_sql
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = "SELECT * FROM lineitem"
    
    read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
    
  • Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

    from dataprep.connector import read_sql
    
    postgres_url = "postgresql://username:password@server:port/database"
    queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]
    
    read_sql(postgres_url, queries)
    
  • Read a DataFrame parallelly using 4 threads from a more complex query

    from dataprep.connector import read_sql
    
    postgres_url = "postgresql://username:password@server:port/database"
    query = f"""
    SELECT l_orderkey,
           SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
           o_orderdate,
           o_shippriority
    FROM   customer,
           orders,
           lineitem
    WHERE  c_mktsegment = 'BUILDING'
           AND c_custkey = o_custkey
           AND l_orderkey = o_orderkey
           AND o_orderdate < DATE '1995-03-15'
           AND l_shipdate > DATE '1995-03-15'
    GROUP  BY l_orderkey,
              o_orderdate,
              o_shippriority
    """
    
    read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)