DWS SQL Writing Rules

DDL

[Proposal] In DWS, you are advised to execute DDL operations, such as creating table or making comments, separately from batch processing jobs to avoid performance deterioration caused by many concurrent transactions.
[Proposal] Execute data truncation after unlogged tables are used because DWS cannot ensure the security of unlogged tables in abnormal scenarios.
[Proposal] Suggestions on the storage mode of temporary and unlogged tables are the same as those on base tables. Create temporary tables in the same storage mode as the base tables to avoid high computing costs caused by hybrid row and column correlation.
[Proposal] The total length of an index column cannot exceed 50 bytes. Otherwise, the index size will increase greatly, resulting in large storage cost and low index performance.
[Proposal] Do not delete objects using DROP...CASCADE, unless the dependency between objects is specified. Otherwise, the objects may be deleted by mistake.

Data Loading and Uninstalling

[Proposal] Provide the inserted column list in the insert statement. Example:

       
            INSERT INTO task(name,id,comment) VALUES ('task1','100','100th task');

[Proposal] After data is imported to the database in batches or the data increment reaches the threshold, you are advised to analyze tables to prevent the execution plan from being degraded due to inaccurate statistics.
[Proposal] To clear all data in a table, you are advised to use TRUNCATE TABLE instead of DELETE TABLE. DELETE TABLE is not efficient and cannot release disk space occupied by the deleted data.

Type conversion

[Proposal] Perform type coercion to convert data types. If you perform implicit conversion, the result may differ from expected.
[Proposal] During data query, explicitly specify the data type for constants, and do not attempt to perform any implicit data type conversion.
[Notice] In Oracle compatibility mode, null strings will be automatically converted to NULL during data import. If a null string needs to be reserved, you need to create a database that is compatible with Teradata.

Query Operation

[Proposal] Do not return a large number of result sets to a client except the ETL program. If a large result set is returned, consider modifying your service design.
[Proposal] Perform DDL and DML operations encapsulated in transactions. Operations like table truncation, update, deletion, and dropping, cannot be rolled back once committed. You are advised to encapsulate such operations in transactions so that you can roll back the operations if necessary.
[Proposal] During query compilation, you are advised to list all columns to be queried and avoid using *. Doing so reduces output lines, improves query performance, and avoids the impact of adding or deleting columns on front-end service compatibility.
[Proposal] During table object access, add the schema prefix to the table object to avoid accessing an unexpected table due to schema switchover.
[Proposal] The cost of joining more than eight tables or views, especially full joins, is difficult to be estimated. You are advised to use the WITH TABLE AS statement or other methods to create interim tables to improve the readability of SQL statements.
[Proposal] Do not use Cartesian products or full joins. Cartesian products and full joins will result in a sharp expansion of result sets and poor performance.
[Notice] Only IS NULL and IS NOT NULL can be used to determine NULL value comparison results. If any other method is used, NULL is returned. For example, NULL instead of expected Boolean values is returned for NULL<>NULL, NULL=NULL, and NULL<>1.
[Notice] Do not use count(col) instead of count(*) to count the total number of records in a table. count(*) counts the NULL value (actual rows) while count (col) does not.
[Notice] While executing count(col), the number of NULL record rows is counted as 0. While executing sum(col), NULL is returned if all records are NULL. If not all the records are NULL, the number of NULL record rows is counted as 0.
[Notice] To count multiple columns using count(), column names must be enclosed with parentheses. For example, count ((col1, col2, col3)). Note: When multiple columns are used to count the number of NULL record rows, a row is counted even if all the selected columns are NULL. The result is the same as that when count(*) is executed.
[Notice] Null records are not counted when count(distinct col) is used to calculate the number of non-null columns that are not repeated.
[Notice] If all statistical columns are NULL when count(distinct (col1,col2,...)) is used to count the number of unique values in multiple columns, Null records are also counted, and the records are considered the same.
[Notice] When constants are used to filter data, the system searches for functions used for calculating these two data types based on the data types of the constants and matched columns. If no function is found, the system converts the data type implicitly. Then, the system searches for a function used for calculating the converted data type.
```
SELECT * FROM test WHERE timestamp_col = 20000101;
```
In the preceding example, if timestamp_col is the timestamp type, the system first searches for the function that supports the "equal" operation of the timestamp and int types (constant numbers are considered as the int type). If no such function is found, the timestamp_col data and constant numbers are implicitly converted into the text type for calculation.

[Proposal] Do not use scalar subquery statements. A scalar subquery appears in the output list of a SELECT statement. In the following example, the part enclosed in parentheses is a scalar subquery statement:

       
            SELECT id, (SELECT COUNT(*) FROM films f WHERE f.did = s.id) FROM staffs_p1 s;

Scalar subqueries often result in query performance deterioration. During application development, scalar subqueries need to be converted into equivalent table associations based on the service logic.

[Proposal] In WHERE clauses, the filtering conditions should be sorted. The condition that few records are selected for reading (the number of filtered records is small) is listed at the beginning.

[Proposal] Filtering conditions in WHERE clauses should comply with unilateral rules. That is, when the column name is placed on one side of a comparison operator, the optimizer automatically performs pruning optimization in some scenarios. Filtering conditions in a WHERE clause will be displayed in col op expression format, where col indicates a table column, op indicates a comparison operator, such as = and >, and expression indicates an expression that does not contain a column name. For example:

       
            SELECT id, from_image_id, from_person_id, from_video_id FROM face_data WHERE current_timestamp(6) - time < '1 days'::interval;

The modification is as follows:

       
            SELECT id, from_image_id, from_person_id, from_video_id FROM face_data where time >  current_timestamp(6) - '1 days'::interval;

[Proposal] Do not perform unnecessary sorting operations. Sorting requires a large amount of memory and CPU. If service logic permits, ORDER BY and LIMIT can be combined to reduce resource overhead. By default, data in DWS is sorted by ASC & NULL LAST.
[Proposal] When the ORDER BY clause is used for sorting, specify sorting modes (ASC or DESC), and use NULL FIRST or NULL LAST for NULL record sorting.
[proposal] Do not rely on only the LIMIT clause to return the result set displayed in a specific sequence. Combine ORDER BY and LIMIT clauses for some specific result sets and use offset to skip specific results if necessary.
[Proposal] If the service logic is accurate, you are advised to use UNION ALL instead of UNION.

[Proposal] If a filtering condition contains only an OR expression, convert the OR expression to UNION ALL to improve performance. SQL statements that use OR expressions cannot be optimized, resulting in slow execution. Example:

       
            SELECT * FROM scdc.pub_menu 
WHERE (cdp= 300 AND inline=301) OR (cdp= 301 AND inline=302) OR (cdp= 302 AND inline=301);

Convert the statement to the following:

       
            SELECT * FROM scdc.pub_menu 
WHERE (cdp= 300 AND inline=301) 
union all
SELECT * FROM scdc.pub_menu 
WHERE (cdp= 301 AND inline=302) 
union all 
SELECT * FROM scdc.pub_menu 
WHERE (cdp= 302 AND inline=301);

[Proposal] If an in(val1, val2, va...) expression contains a large number of columns, you are advised to replace it with the in (values (va1), (val2),(val3...) statement. The optimizer will automatically convert the IN constraint into a non-correlated subquery to improve the query performance.

[Proposal] Replace (not) in with (not) exist when associated columns do not contain NULL values. For example, in the following query statement, if the T1.C1 column does not contain any NULL value, add the NOT NULL constraint to the T1.C1 column, and then rewrite the statements.

       
            SELECT * FROM T1 WHERE T1.C1 NOT IN (SELECT T2.C2 FROM T2);

Rewrite the statement as follows:

       
            SELECT * FROM T1 WHERE NOT EXISTS (SELECT  * FROM T1,T2 WHERE T1.C1=T2.C2);

If the value of the T1.C1 column will possibly be NULL, the preceding rewriting cannot be performed.
If T1.C1 is the output of a subquery, check whether the output is NOT NULL based on the service logic.

[Proposal] Use cursors instead of the LIMIT OFFSET syntax to perform pagination queries to avoid resource overheads caused by multiple executions. A cursor must be used in a transaction, and you must disable it and commit transaction once the query is finished.