Help Center/ GaussDB(DWS)/ Performance Tuning/ Optimization Cases/ Case: Rewriting SQL Statements and Deleting in-clause
Updated on 2023-06-25 GMT+08:00

Case: Rewriting SQL Statements and Deleting in-clause

Before Optimization

in-clause/any-clause is a common SQL statement constraint. Sometimes, the clause following in or any is a constant. For example:

1
2
3
4
select 
count(1) 
from calc_empfyc_c1_result_tmp_t1 
where ls_pid_cusr1 in ('20120405', '20130405');

or

1
2
3
4
select 
count(1) 
from calc_empfyc_c1_result_tmp_t1 
where ls_pid_cusr1 in any('20120405', '20130405');

Some special usages are as follows:

1
2
3
4
5
SELECT 
ls_pid_cusr1,COALESCE(max(round((current_date-bthdate)/365)),0)
FROM calc_empfyc_c1_result_tmp_t1 t1,p10_md_tmp_t2 t2
WHERE t1.ls_pid_cusr1 = any(values(id),(id15))
GROUP BY ls_pid_cusr1;

Where id and id15 are columns of p10_md_tmp_t2. ls_pid_cusr1 = any(values(id),(id15)) equals t1. ls_pid_cusr1 = id or t1. ls_pid_cusr1 = id15.

Therefore, join-condition is essentially an inequality, and nestloop must be used for this join operation. The execution plan is as follows:

After Optimization

The test result shows that both result sets are too large. As a result, nestloop is time-consuming with more than one hour to return results. Therefore, the key to performance optimization is to eliminate nestloop, using more efficient hashjoin. From the perspective of semantic equivalence, the SQL statements can be written as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
select
ls_pid_cusr1,COALESCE(max(round(ym/365)),0)
from
(
         (
                   SELECT 
                            ls_pid_cusr1,(current_date-bthdate) as ym
                   FROM calc_empfyc_c1_result_tmp_t1 t1,p10_md_tmp_t2 t2
                   WHERE t1.ls_pid_cusr1 = t2.id and t1.ls_pid_cusr1 != t2.id15
         )
         union all
         (
                   SELECT 
                            ls_pid_cusr1,(current_date-bthdate) as ym
                   FROM calc_empfyc_c1_result_tmp_t1 t1,p10_md_tmp_t2 t2
                   WHERE t1.ls_pid_cusr1 = id15
         )
)
GROUP BY ls_pid_cusr1;

Note: Use UNION ALL instead of UNION if possible. UNION eliminates duplicate rows while merging two result sets but UNION ALL merges the two result sets without deduplication. Therefore, replace UNION with UNION ALL if you are sure that the two result sets do not contain duplicate rows based on the service logic.

The optimized SQL queries consist of two equivalent join subqueries, and each subquery can be used for hashjoin in this scenario. The optimized execution plan is as follows:

Before the optimization, no result is returned for more than 1 hour. After the optimization, the result is returned within 7s.