SQL Statement Rewriting Rules

Based on the database SQL execution mechanism and a large number of practices, summarize finds that: using rules of a certain SQL statement, on the basis of the so that the correct test result, which can improve the SQL execution efficiency. You can comply with these rules to greatly improve service query efficiency.

Replacing UNION with UNION ALL
UNION eliminates duplicate rows while merging two result sets but UNION ALL merges the two result sets without deduplication. Therefore, replace UNION with UNION ALL if you are sure that the two result sets do not contain duplicate rows based on the service logic.
Adding NOT NULL to the join column
If there are many NULL values in the join column, adding an IS NOT NULL filter condition allows for pre-filtering of data, thereby enhancing join efficiency.
Use hashagg.
If a plan involving groupAgg and SORT operations generated by the GROUP BY statement is poor in performance, you can set work_mem to a larger value to generate a hashagg plan, which does not require sorting and improves the performance.
Replace functions with CASE statements
The performance of DataArts Fabric SQL significantly deteriorates if too many functions are called. To address this, you can change the pushdown functions to CASE statements based on site requirements.
Do not use functions or expressions for partition keys.
Using functions or expressions for partition keys impacts partition pruning and leads to an increase in the volume of data scanned.
Do not use != or <> operators, NULL, OR, or implicit parameter conversion in WHERE clauses.
Split complex SQL statements.
You can split a SQL statement and save the execution result to a temporary table if the SQL statement is too complex to be optimized using the solutions above, including but not limited to the following scenarios:
- The same subquery is involved in multiple SQL statements of a task and the subquery contains large amounts of data.
- Functions (such as subst and to_number) lead to inaccurate calculation of selectivity for subqueries with large data volumes.
- BROADCAST subqueries are performed on large tables in multi-DN environment.