Help Center > > Developer Guide> Query Performance Optimization> Tuning Queries> Typical SQL Optimization Methods> Optimizing Subqueries

Optimizing Subqueries

Updated at: May 15, 2019 GMT+08:00

What Is a Subquery

A SELECT...FROM...WHERE... statement is called a query block. Nesting a query block in the FROM, WHERE, or HAVING clause of another query is called a nested query. Queries nested in other queries are nested subqueries or subqueries for short.

Subqueries can be categorized into correlated subqueries and uncorrelated subqueries. Execution of a correlated subquery depends on some attributes of its parent query and the parameters delivered by the parent query. If some of the parameters have been changed, you need to execute the subquery again to obtain a new result. Uncorrelated subqueries are executed separately, and you need to only execute each once.

Optimization Example

Uncorrelated subqueries do not affect the query performance, because each of them is executed once only. Correlated subqueries need to be executed more than once, greatly affecting the query performance. Two common examples are as follows:

1. Change the base table to a replication table and create an index on the filter column.

create table master_table (a int);
create table sub_table(a int, b int);
select a from master_table group by a having a in (select a from sub_table); 

In this example, a correlated subquery is contained. To improve the query performance, you can change sub_table to a replication table and create an index on the a column.

2. Modify the SELECT statement, change the subquery to a JOIN relationship between the primary table and the parent query, or modify the subquery to improve the query performance. Ensure that the subquery to be used is semantically correct.

explain (costs off)select * from master_table as t1 where t1.a in (select t2.a from sub_table as t2 where t1.a = t2.b);
                        QUERY PLAN
----------------------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Seq Scan on master_table t1
         Filter: (SubPlan 1)
         SubPlan 1
           ->  Result
                 Filter: (t1.a = t2.b)
                 ->  Materialize
                       ->  Streaming(type: BROADCAST)
                             Spawn on: All datanodes
                             ->  Seq Scan on sub_table t2
(11 rows)

In the preceding example, a subplan is used. To remove the subplan, you can modify the statement as follows:

explain(costs off) select * from master_table as t1 where exists (select t2.a from sub_table as t2 where t1.a = t2.b and t1.a = t2.a);
                    QUERY PLAN
--------------------------------------------------
 Streaming (type: GATHER)
   Node/s: All datanodes
   ->  Hash Semi Join
         Hash Cond: (t1.a = t2.b)
         ->  Seq Scan on master_table t1
         ->  Hash
               ->  Streaming(type: REDISTRIBUTE)
                     Spawn on: All datanodes
                     ->  Seq Scan on sub_table t2
(9 rows)

In this way, the subplan is replaced by the semi-join between the two tables, greatly improving the execution efficiency.

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel