Updated on 2025-08-25 GMT+08:00

Examples of DataFrames Without UDFs

Below is an example using query1 from tpch to demonstrate the usage of DataFrame.

The query SQL is as follows:

SELECT
    l_returnflag,
    l_linestatus,
    sum(l_quantity) AS sum_qty,
    sum(l_extendedprice) AS sum_base_price,
    sum(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
    sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
    avg(l_quantity) AS avg_qty,
    avg(l_extendedprice) AS avg_price,
    avg(l_discount) AS avg_disc,
    count(*) AS count_order
FROM
    lineitem
WHERE
    l_shipdate <= CAST('1998-09-02' AS date)
GROUP BY
    l_returnflag,
    l_linestatus
ORDER BY
    l_returnflag,    l_linestatus;

The corresponding DataFrame logic is as follows:

import ibis   # Import Ibis dependencies.
con = ibis.fabric.connect(    # Call the DataArts Fabric backend connection to establish a connection.
endpoint=FABRIC_ENDPOINT,         # Specify the service region. Refer to Regions and Endpoints.
endpoint_id=FABRIC_ENDPOINT_ID,   # Query endpoint_id. See "Appendix" in the API Reference.
domain=FABRIC_DOMAIN,      # Tenant name
user=FABRIC_USER,                  # IAM username
password=FABRIC_PASS,          # IAM password
project_id=FABRIC_PROJECT_ID,    # How Do I Obtain a Project ID?
catelog_name=IBIS_TEST_FABRIC_CATELOG,      # Connect to a specified catalog.
workspace_id=FABRIC_WORKSPACE_ID,   # Get workspace_id. See "Appendix" in the API Reference.
lakeformation_instance_id=IBIS_TEST_FABRIC_LAKEFORMATION_INSTANCE_ID,          # LakeFormation instance ID
obs_directory_base=OBS_DIRECTORY_BASE,   # Storage path for UDFs in OBS
obs_bucket_name=OBS_BUCKET_NAME,      # OBS bucket name
obs_server=OBS_SERVER,    # OBS access address. Refer to Endpoints and Domain Names.
)
t = con.table("lineitem", database="tpch")    # Retrieve table information by connecting to the backend and create a table object.
q = t.filter(t.l_shipdate <= add_date("1998-12-01", dd=-90))
discount_price = t.l_extendedprice * (1 - t.l_discount)
charge = discount_price * (1 + t.l_tax)
q = q.group_by(["l_returnflag", "l_linestatus"])
q = q.aggregate(
    sum_qty=t.l_quantity.sum(),
    sum_base_price=t.l_extendedprice.sum(),
    sum_disc_price=discount_price.sum(),
    sum_charge=charge.sum(),
    avg_qty=t.l_quantity.mean(),
    avg_price=t.l_extendedprice.mean(),
    avg_disc=t.l_discount.mean(),
    count_order=lambda t: t.count(),
)
q = q.order_by(["l_returnflag", "l_linestatus"])
sql = q.compile()            # Compile DataFrame into a SQL string.
df = q.execute()                  # Execute the expression and return the result set.