Basic usage#
ConnectorX enables you to run the SQL query, load data from databases into a Pandas Dataframe in the fastest and most memory efficient way.
API#
connectorx.read_sql(conn: Union[str, Dict[str, str]], query: Union[List[str], str], *, return_type: str = "pandas", protocol: str = "binary", partition_on: Optional[str] = None, partition_range: Optional[Tuple[int, int]] = None, partition_num: Optional[int] = None)
Parameters#
conn: Union[str, Dict[str, str]]
: Connection string URI for querying single database or dict of database names (key) and connection string URIs (value) for querying multiple databases.Please check out here for connection string examples of each database
query: Union[str, List[str]]
: SQL query or list of partitioned SQL queries for fetching data.return_type: str = "pandas"
: The return type of this function. It can bearrow
(arrow2
),pandas
,modin
,dask
orpolars
.protocol: str = "binary"
: The protocol used to fetch data from source, default isbinary
. Check out here to see more details.partition_on: Optional[str]
: The column to partition the result.partition_range: Optional[Tuple[int, int]]
: The value range of the partition column.partition_num: Optional[int]
: The number of partitions to generate.index_col: Optional[str]
: The index column to set for the result dataframe. Only applicable whenreturn_type
ispandas
,modin
ordask
.
Examples#
Read a DataFrame from a SQL using a single thread
import connectorx as cx postgres_url = "postgresql://username:password@server:port/database" query = "SELECT * FROM lineitem" cx.read_sql(postgres_url, query)
Read a DataFrame parallelly using 10 threads by automatically partitioning the provided SQL on the partition column (
partition_range
will be automatically queried if not given)import connectorx as cx postgres_url = "postgresql://username:password@server:port/database" query = "SELECT * FROM lineitem" cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=10)
Read a DataFrame parallelly using 2 threads by manually providing two partition SQLs (the schemas of all the query results should be same)
import connectorx as cx postgres_url = "postgresql://username:password@server:port/database" queries = ["SELECT * FROM lineitem WHERE l_orderkey <= 30000000", "SELECT * FROM lineitem WHERE l_orderkey > 30000000"] cx.read_sql(postgres_url, queries)
Read a DataFrame parallelly using 4 threads from a more complex query
import connectorx as cx postgres_url = "postgresql://username:password@server:port/database" query = f""" SELECT l_orderkey, SUM(l_extendedprice * ( 1 - l_discount )) AS revenue, o_orderdate, o_shippriority FROM customer, orders, lineitem WHERE c_mktsegment = 'BUILDING' AND c_custkey = o_custkey AND l_orderkey = o_orderkey AND o_orderdate < DATE '1995-03-15' AND l_shipdate > DATE '1995-03-15' GROUP BY l_orderkey, o_orderdate, o_shippriority """ cx.read_sql(postgres_url, query, partition_on="l_orderkey", partition_num=4)
Read a DataFrame from a SQL joined from multiple databases (experimental, only support PostgreSQL for now)
import connectorx as cx db1 = "postgresql://username1:password1@server1:port1/database1" db2 = "postgresql://username2:password2@server2:port2/database2" query = "SELECT * FROM db1.nation n, db2.region r where n.n_regionkey = r.r_regionkey" cx.read_sql({"db1": db1, "db2": db2}, query)