Optimizing Subqueries
Subquery Background
Applications use a significant amount of subqueries when performing operations on databases using SQL statements. This approach offers greater clarity in both structure and thought process compared to directly joining two tables. Particularly in more complex queries, subqueries provide more comprehensive and independent semantics, making SQL expressions of service logic clearer and easier to comprehend, hence their widespread adoption.
DataArts Fabric SQL categorizes subqueries into subqueries and sublinks based on their positions within SQL statements.
- Subquery: corresponds to a scope table (RangeTblEntry) in the query parse tree. That is, a subquery is a SELECT statement following immediately after the FROM keyword.
- Sublink: corresponds to an expression in the query parsing tree. That is, a sublink is a statement in the WHERE or ON clause or in the target list.
In summary, within the context of the query parse tree, a subquery fundamentally represents a range table, whereas a sublink embodies an expression. Given that sublinks can manifest within both constraint conditions and expressions, according to the implementation of sublinks in DataArts Fabric SQL, they can be categorized as follows:
- exist_sublink: corresponding to the EXIST and NOT EXIST statements.
- any_sublink: corresponding to the OP ANY(SELECT...) statement. OP can be the IN, <, >, or = operator.
- all_sublink: corresponding to the OP ALL(SELECT...) statement. OP can be the IN, <, >, or = operator.
- rowcompare_sublink: corresponding to the RECORD OP (SELECT...) statement.
- expr_sublink: corresponding to the (SELECT with a single target list item) statement.
- array_sublink: corresponding to the ARRAY(SELECT...) statement.
- cte_sublink: corresponding to the WITH(...) statement.
In OLAP and HTAP scenarios, the commonly used sublinks are exist_sublink and any_sublink. These sublinks have been optimized (sub-link pull-up) in the optimization engine of DataArts Fabric SQL for their use cases. Due to the flexibility in using subqueries within SQL statements, overly complex SQL subqueries can lead to performance issues. Broadly speaking, subqueries are categorized into non-correlated subqueries and correlated subqueries.
- Non-correlated subquery
The execution of a subquery is independent from any attribute of outer queries. In this way, a subquery can be executed before outer queries.
Example:
1 2 3 4 5 6 7
select t1.c1,t1.c2 from t1 where t1.c1 in ( select c2 from t2 where t2.c2 IN (2,3,4) );
- Correlated subquery
The execution of a subquery depends on some attributes of outer queries which are used as AND conditions of the subquery. In the following example, t1.c1 in the t2.c1 = t1.c1 condition is a dependent attribute. Such a subquery depends on outer queries and needs to be executed once for each outer query.
Example:
1 2 3 4 5 6 7
select t1.c1,t1.c2 from t1 where t1.c1 in ( select c2 from t2 where t2.c1 = t1.c1 AND t2.c2 in (2,3,4) );
Optimization of Sublinks in DataArts Fabric SQL
A subquery is pulled up to join with tables in outer queries, preventing the subquery from being converted into the combination of a subplan and broadcast. You can run the EXPLAIN statement to check whether a subquery is converted into the combination of a subplan and broadcast.
- Sublink-release scenarios supported by DataArts Fabric SQL:
- Pulling up the IN sublink
- The subquery cannot contain columns in the outer query (columns in more outer queries are allowed).
- The subquery cannot contain volatile functions.
- Pulling up the EXISTS sublink
The WHERE clause must contain a column in the outer query. Other parts of the subquery cannot contain the column. Other restrictions are as follows:
- The subquery must contain the FROM clause.
- The subquery cannot contain the WITH clause.
- The subquery cannot contain aggregate functions.
- The subquery cannot contain a SET, SORT, LIMIT, WindowAgg, or HAVING operation.
- The subquery cannot contain volatile functions.
- Pulling up an equivalent query containing aggregation functions
The WHERE condition of the subquery must contain a column from the outer query. Equivalence comparison must be performed between this column and related columns in tables of the subquery. These conditions must be connected using AND. Other parts of the subquery cannot contain the column. Other restrictions are as follows:
- The expression in the WHERE condition of the subquery must be table columns.
- After the SELECT keyword of the subquery, there must be only one output column. The output column must be an aggregate function (for example, MAX), and the parameter (for example, t2.c2) of the aggregate function cannot be columns of a table (for example, t1) in outer queries. The aggregate function cannot be COUNT.
For example, the following subquery can be pulled up:
1 2 3
select * from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1 );
The following subquery cannot be pulled up because the subquery has no aggregation function.
1 2 3
select * from t1 where c1 >( select t2.c1 from t2 where t2.c1=t1.c1 );
The following subquery cannot be pulled up because the subquery has two output columns:
1 2 3
select * from t1 where (c1,c2) >( select max(t2.c1),min(t2.c2) from t2 where t2.c1=t1.c1 );
- The subquery must be a FROM clause.
- The subquery cannot contain a GROUP BY, HAVING, or SET operation.
- The subquery can only be inner join.
- The target list of the subquery cannot contain the function that returns a set.
- The WHERE condition of the subquery must contain a column from the outer query. Equivalence comparison must be performed between this column and related columns in tables of the subquery. These conditions must be connected using AND. Other parts of the subquery cannot contain the column. For example, the following subquery can be pulled up:
1 2 3 4 5
select * from t3 where t3.c1=( select t1.c1 from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1 ));
If another condition is added to the subquery in the previous example, the subquery cannot be pulled up because the subquery references to the column in the outer query. Example:
1 2 3 4 5 6
select * from t3 where t3.c1=( select t1.c1 from t1 where c1 >( select max(t2.c1) from t2 where t2.c1=t1.c1 and t3.c1>t2.c2 ));
- Pulling up a sublink in the OR clause
If the WHERE condition contains a EXIST-related sublink connected by OR,
for example,
1 2 3
select a, c from t1 where t1.a = (select avg(a) from t3 where t1.b = t3.b) or exists (select * from t4 where t1.c = t4.c);
The procedure for promoting the OR clause of an EXIST-related subquery in an OR-ed join is as follows:
- Extract opExpr from the OR clause in the WHERE condition. The value is t1.a = (select avg(a) from t3 where t1.b = t3.b).
- The opExpr contains a subquery. If the subquery can be pulled up, the subquery is rewritten as elect avg(a), t3.b from t3 group by t3.b, generating the NOT NULL condition t3.b is not null. The opExpr is replaced with this NOT NULL condition. In this case, the SQL statement changes to:
1 2 3
select a, c from t1 left join (select avg(a) avg, t3.b from t3 group by t3.b) as t3 on (t1.a = avg and t1.b = t3.b) where t3.b is not null or exists (select * from t4 where t1.c = t4.c);
- Extract the EXISTS sublink exists (select * from t4 where t1.c = t4.c) from the OR clause to check whether the sublink can be pulled up. If it can be pulled up, it is converted into select t4.c from t4 group by t4.c, generating the NOT NULL condition t4.c is not null. In this case, the SQL statement changes to:
1 2 3
select a, c from t1 left join (select avg(a) avg, t3.b from t3 group by t3.b) as t3 on (t1.a = avg and t1.b = t3.b) left join (select t4.c from t4 group by t4.c) where t3.b is not null or t4.c is not null;
- Pulling up the IN sublink
- Sublink-release scenarios not supported by DataArts Fabric SQL:
Except the sublinks described above, all the other sublinks cannot be pulled up. In this case, a join subquery is planned as the combination of a subplan and broadcast. As a result, if tables in the subquery have a large amount of data, query performance may be poor.
If a correlated subquery joins with two tables in outer queries, the subquery cannot be pulled up. You need to change the parent query into a WITH clause and then perform the join.
Example:
1 2
select distinct t1.a, t2.a from t1 left join t2 on t1.a=t2.a and not exists (select a,b from test1 where test1.a=t1.a and test1.b=t2.a);
The parent query is changed into:
1 2 3 4 5 6 7 8
with temp as ( select * from (select t1.a as a, t2.a as b from t1 left join t2 on t1.a=t2.a) ) select distinct a,b from temp where not exists (select a,b from test1 where temp.a=test1.a and temp.b=test1.b);
- The subquery (without COUNT) in the target list cannot be pulled up.
Example:
1 2 3 4
explain (costs off) select (select c2 from t2 where t1.c1 = t2.c1) ssq, t1.c2 from t1 where t1.c2 > 10;
The execution plan is as follows:
1 2 3 4
explain (costs off) select (select c2 from t2 where t1.c1 = t2.c1) ssq, t1.c2 from t1 where t1.c2 > 10;
The correlated subquery is displayed in the target list (query return list). Values need to be returned even if the condition t1.c1=t2.c1 is not met. Therefore, use a left outer join to join t1 and t2 so that the SSQ can return padding values when the condition t1.c1=t2.c1 is not met.
ScalarSubQuery (SSQ) and Correlated-ScalarSubQuery (CSSQ) are described as follows:
- SSQ: a sublink that returns only a single row and column scalar value
- CSSQ: an SSQ containing conditions
The preceding SQL statement can be changed into:
1 2 3 4 5 6 7
with ssq as ( select t2.c1, t2.c2 from t2 ) select ssq.c2, t1.c2 from t1 left join ssq on t1.c1 = ssq.c1 where t1.c2 > 10;
The execution plan after the change is as follows:
In the preceding example, the SSQ is pulled up to right join, preventing poor performance caused by the combination of a subplan and broadcast when the table (T2) in the subquery is too large.
- The subquery (with COUNT) in the target list cannot be pulled up.
Example:
1 2 3
select (select count(*) from t2 where t2.c1=t1.c1) cnt, t1.c1, t3.c1 from t1,t3 where t1.c1=t3.c1 order by cnt, t1.c1;
The execution plan is as follows:
The correlated subquery is displayed in the target list (query return list). Values need to be returned even if the condition t1.c1=t2.c1 is not met. Therefore, use a left outer join to join t1 and t2 so that the SSQ can return padding values when the condition t1.c1=t2.c1 is not met. However, COUNT is used, which requires that 0 is returned when the condition is not met. case-when NULL then 0 else count(*) can be used.
The preceding SQL statement can be changed into:
1 2 3 4 5 6 7 8 9 10 11
with ssq as ( select count(*) cnt, c1 from t2 group by c1 ) select case when ssq.cnt is null then 0 else ssq.cnt end cnt, t1.c1, t3.c1 from t1 left join ssq on ssq.c1 = t1.c1,t3 where t1.c1 = t3.c1 order by ssq.cnt, t1.c1;
The execution plan after the change is as follows:
- Pulling up nonequivalent subqueries
1 2 3
select t1.c1, t1.c2 from t1 where t1.c1 = (select agg() from t2.c2 > t1.c2);
Nonequivalent subqueries cannot be pulled up. You can perform join twice (one CorrelationKey and one rownum self-join) to rewrite the statement.
You can rewrite the statement in either of the following ways:
- Subquery rewriting
1 2 3 4 5 6 7
select t1.c1, t1.c2 from t1, ( select t1.rowid, agg() aggref from t1,t2 where t1.c2 > t2.c2 group by t1.rowid ) dt /* derived table */ where t1.rowid = dt.rowid AND t1.c1 = dt.aggref;
- CTE rewriting
1 2 3 4 5 6 7 8 9 10
WITH dt as ( select t1.rowid, agg() aggref from t1,t2 where t1.c2 > t2.c2 group by t1.rowid ) select t1.c1, t1.c2 from t1, dt where t1.rowid = dt.rowid AND t1.c1 = dt.aggref;
- Subquery rewriting
- Currently, DataArts Fabric SQL cannot efficiently implement globally unique row IDs of tables and intermediate result sets. Therefore, it is difficult to rewrite the row IDs in this scenario. You are advised to avoid this problem at the service layer or use t1.xc_node_id + t1.ctid to associate row IDs. However, the high repetition rate of xc_node_id reduces the join efficiency. The xc_node_id+ctid type cannot be used as the join condition of hashjoin.
- If the AGG type is COUNT(*), 0 is used for data padding if CASE-WHEN is not matched. If the type is not COUNT(*), NULL is used.
- CTE rewriting works better using ShareScan.
- The subquery (without COUNT) in the target list cannot be pulled up.
More Optimization Examples
Example: Modify the SELECT statement by changing the subquery to a JOIN relationship between the primary table and the parent query or modifying the subquery to improve the query performance. Ensure that the subquery to be used is semantically correct.
1
|
explain (costs off)select * from t1 where t1.c1 in (select t2.c1 from t2 where t1.c2 = t2.c2); |
In the preceding example, a subplan exists in the plan. To delete the subplan, modify the statement as follows:
1
|
explain(costs off) select * from t1 where exists (select 1 from t2 where t1.c1 = t2.c1 and t1.c2 = t2.c2); |
In this way, the subplan is replaced by the semi-join between the two tables, significantly improving the execution efficiency.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot