Scenario
Enterprises usually store mass data, such as various databases and warehouses, for management and information collection. However, diversified data sources, hybrid dataset structures, and scattered data storage lower query efficiency.
The open-sourced Spark only supports pushdown of simple FILTER during querying of multi-sourced data. The SQL engine performance is deteriorated due of a large amount of unnecessary data transmission. The pushdown function is enhanced, so that AGGREGATE, COMPLEX PROJECTION, and COMPLEX PREDICATE can be pushed to the data source, reducing unnecessary data transmission and improving query performance.
Only the JDBC data source supports pushdown of query operations, such AS AGGREGATE, PROJECTION, PREDICATE, AGGREGATE OVER INNER JOIN, and AGGREGATE OVER UNION ALL. All pushdown operation can be enabled based on the requirements of scenarios.
Table 1 Enhanced query of multi-sourced data
Operation |
Before Enhancement |
After Enhancement |
aggregate |
The pushdown of AGGREGATE is not supported |
- Calculation of aggregation functions is supported. For example: select avg(a) + max(b) from table
- Pushdown of functions is supported.
Pushdown of lines in mathematics, time, and character string functions, such as abs(), month(), and length() are supported. In addition to the preceding built-in functions, functions supported by data sources can be added by SET command.
For example: select sum(abs(a)) from table
|
projection |
Only pushdown of simple PROJECTION is supported. For example: select a, b from table |
- Pushdown of complex expressions is supported
For example: select (a+b)*c from table
- Pushdown of functions listed in the nodes following the table is supported.
For example: select length(a)+abs(b) from table
- Pushdown of LIMIT and ORDER BY after PROJECTION is supported.
For example: select a, b+c from table order by a limit 3
|
predicate |
Only simple filter with the column name on the left of the operator and values on the right is supported. For example:
select * from table where a>0 or b in ("aaa", "bbb") |
- Pushdown of complex expressions is supported.
For example: select * from table where a+b>c*d or a/c in (1, 2, 3)
- Pushdown of functions listed in the nodes following the table is supported.
For example: select * from table where length(a)>5
|
aggregate over inner join |
Related data from the two tables must be loaded to Spark. JOIN must be performed before AGGREGATE. |
Following scenarios are supported:
- Aggregation functions including SUM, AVG, MAX, MIN, and COUNT are supported.
- All AGGREGATE operations can be performed in a same table. GROUP BY can be performed on one or two tables and only inner join is supported.
Following scenarios are not supported:
- AGGREGATE both the left table and right table.
- AGGREGATE contains operations, for example, sum(a+b).
- AGGREGATE operations, for example, sum(a)+min(b).
|
aggregate over union all |
Related data from the two tables must be loaded to Spark. UNION must be performed before AGGREGATE. |
Aggregation functions including SUM, AVG, MAX, MIN, and COUNT are supported.
Following scenarios are not supported:
- AGGREGATE contains operations, for example, sum(a+b).
- AGGREGATE operations, for example, sum(a)+min(b).
|
Precautions
- If the external data source is Hive, query operation cannot be performed on external tables created by Spark.
- The data source only supports MySQL and MPPDB.