Updated on 2022-02-22 GMT+08:00

Enhanced SQL Query of Multi-sourced Data

Scenario

Enterprises usually store mass data, such as various databases and warehouses, for management and information collection. However, diversified data sources, hybrid dataset structures, and scattered data storage lower query efficiency.

The open-sourced Spark only supports pushdown of simple FILTER during querying of multi-sourced data. The SQL engine performance is deteriorated due of a large amount of unnecessary data transmission. The pushdown function is enhanced, so that AGGREGATE, COMPLEX PROJECTION, and COMPLEX PREDICATE can be pushed to the data source, reducing unnecessary data transmission and improving query performance.

Only the JDBC data source supports pushdown of query operations, such AS AGGREGATE, PROJECTION, PREDICATE, AGGREGATE OVER INNER JOIN, and AGGREGATE OVER UNION ALL. All pushdown operation can be enabled based on the requirements of scenarios.

Table 1 Enhanced query of multi-sourced data

Operation

Before Enhancement

After Enhancement

aggregate

The pushdown of AGGREGATE is not supported

  • Aggregation functions including SUM, AVG, MAX, MIN, and COUNT are supported.

    For example: select count(*) from table

  • Internal expressions of aggregation functions are supported.

    For example: select sum(a+b) from table

  • Calculation of aggregation functions is supported. For example: select avg(a) + max(b) from table
  • Pushdown of HAVING is supported

    For example: select sum(a) from table where a>0 group by b having sum(a)>10

  • Pushdown of functions is supported.

    Pushdown of lines in mathematics, time, and character string functions, such as abs(), month(), and length() are supported. In addition to the preceding built-in functions, functions supported by data sources can be added by SET command.

    For example: select sum(abs(a)) from table

  • Pushdown of LIMIT and ORDER BY after AGGREGATE is supported. However, pushdown of LIMIT and ORDER BY is not supported in Oracle, because Oracle does not support LIMIT.

    For example: select sum(a) from table where a>0 group by b order by sum(a) limit 5

projection

Only pushdown of simple PROJECTION is supported. For example: select a, b from table

  • Pushdown of complex expressions is supported

    For example: select (a+b)*c from table

  • Pushdown of functions listed in the nodes following the table is supported.

    For example: select length(a)+abs(b) from table

  • Pushdown of LIMIT and ORDER BY after PROJECTION is supported.

    For example: select a, b+c from table order by a limit 3

predicate

Only simple filter with the column name on the left of the operator and values on the right is supported. For example:

select * from table where a>0 or b in ("aaa", "bbb")

  • Pushdown of complex expressions is supported.

    For example: select * from table where a+b>c*d or a/c in (1, 2, 3)

  • Pushdown of functions listed in the nodes following the table is supported.

    For example: select * from table where length(a)>5

aggregate over inner join

Related data from the two tables must be loaded to Spark. JOIN must be performed before AGGREGATE.

Following scenarios are supported:

  • Aggregation functions including SUM, AVG, MAX, MIN, and COUNT are supported.
  • All AGGREGATE operations can be performed in a same table. GROUP BY can be performed on one or two tables and only inner join is supported.

Following scenarios are not supported:

  • AGGREGATE both the left table and right table.
  • AGGREGATE contains operations, for example, sum(a+b).
  • AGGREGATE operations, for example, sum(a)+min(b).

aggregate over union all

Related data from the two tables must be loaded to Spark. UNION must be performed before AGGREGATE.

Aggregation functions including SUM, AVG, MAX, MIN, and COUNT are supported.

Following scenarios are not supported:

  • AGGREGATE contains operations, for example, sum(a+b).
  • AGGREGATE operations, for example, sum(a)+min(b).

Precautions

  • If the external data source is Hive, query operation cannot be performed on external tables created by Spark.
  • The data source only supports MySQL and MPPDB.