Basic usage

Basic usage#

ConnectorX enables you to run the SQL query, load data from databases into a Pandas Dataframe in the fastest and most memory efficient way.

API#

connectorx.read_sql(conn: Union[str, Dict[str, str]], query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)

Parameters#

conn: Union[str, Dict[str, str]]: Connection string URI for querying single database or dict of database names (key) and connection string URIs (value) for querying multiple databases.
- Please check out here for connection string examples of each database
query: Union[str, List[str]]: SQL query or list of partitioned SQL queries for fetching data.
return_type: str = "pandas": The return type of this function. It can be arrow, arrow_stream, pandas, modin, dask or polars.
protocol: str = "binary": The protocol used to fetch data from source, default is binary. Check out here to see more details.
partition_on: Optional[str]: The column to partition the result.
partition_range: Optional[Tuple[int, int]]: The value range of the partition column.
partition_num: Optional[int]: The number of partitions to generate.
index_col: Optional[str]: The index column to set for the result dataframe. Only applicable when return_type is pandas, modin or dask.
pre_execution_query: Optional[Union[str, List[str]]]: SQL query or list of SQL queries executed before main query. Can be used to set runtime configurations using SET statements. Only applicable for source “Postgres” and “MySQL”
batch_size: Optional[int]: The maximum number of rows of each batch when return_type=arrow_stream.

Examples#

Read a DataFrame from a SQL using a single thread

import connectorx as cx

postgres_url = "postgresql://username:password@server:port/database"
query = "SELECT * FROM lineitem"

cx.read_sql(postgres_url, query)

Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (partition_range will be automatically queried if not given)

import connectorx as cx

postgres_url = "postgresql://username:password@server:port/database"
query = "SELECT * FROM lineitem"

cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)

Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)

import connectorx as cx

postgres_url = "postgresql://username:password@server:port/database"
queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"]

cx.read_sql(postgres_url, queries)

Read a DataFrame parallelly using 4 threads from a more complex query

import connectorx as cx

postgres_url = "postgresql://username:password@server:port/database"
query = f"""
SELECT l_orderkey,
       SUM(l_extendedprice * ( 1 - l_discount )) AS revenue,
       o_orderdate,
       o_shippriority
FROM   customer,
       orders,
       lineitem
WHERE  c_mktsegment = 'BUILDING'
       AND c_custkey = o_custkey
       AND l_orderkey = o_orderkey
       AND o_orderdate < DATE '1995-03-15'
       AND l_shipdate > DATE '1995-03-15'
GROUP  BY l_orderkey,
          o_orderdate,
          o_shippriority 
"""

cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)

Read a DataFrame from a SQL joined from multiple databases (experimental, only support PostgreSQL for now)

import connectorx as cx

db1 = "postgresql://username1:password1@server1:port1/database1"
db2 = "postgresql://username2:password2@server2:port2/database2"
query = "SELECT * FROM db1.nation n, db2.region r where n.n_regionkey = r.r_regionkey"

cx.read_sql({"db1": db1, "db2": db2}, query)

Basic usage

Contents

Basic usage#

API#

Parameters#

Examples#