Updated on 2024-08-20 GMT+08:00

SQL Compilation

DDL

  • In GaussDB, you are advised to perform DDL operations (such as table creation and COMMENT) in a unified manner. Prevent DDL operations in batch processing jobs so that performance is not affected.
  • Perform the TRUNCATE operation immediately after the unlogged table is used. GaussDB cannot ensure the security of unlogged tables in abnormal scenarios.
  • Suggestions on the storage mode of temporary and unlogged tables are the same as those on base tables.
  • The total length of an index column cannot exceed 50 bytes. Otherwise, the index size will increase greatly, resulting in large storage cost and low index performance.
  • Do not delete objects using DROP...CASCADE, unless the dependency between objects is specified. Otherwise, the objects may be deleted by mistake.

Data Loading and Unloading

  • Explicitly set the inserted column list in the INSERT statement. Example:
    1
    INSERT INTO task(name,id,comment) VALUES ('task1','100','100th task');
    
  • After data is imported to the database in batches or the data increment reaches the threshold, you are advised to analyze tables to prevent the execution plan from being degraded due to inaccurate statistics.
  • To clear all data in a table, you are advised to use TRUNCATE TABLE instead of DELETE TABLE. DELETE TABLE is not efficient and cannot release disk space occupied by the deleted data.

Type Conversion

  • Convert data types explicitly. If you perform implicit conversion, the result may differ from expected.
  • During data query, explicitly specify the data type for constants, and do not attempt to perform any implicit data type conversion.
  • If sql_compatibility is set to ORA, null strings will be automatically converted to NULL during data import. If null strings need to be reserved, set sql_compatibility to TD.

Query Operation

  • Do not return a large number of result sets to a client except the ETL program. If a large result set is returned, consider modifying your service design.
  • Perform DDL and DML operations encapsulated in transactions. For example, operations such as TRUNCATE TABLE, UPDATE TABLE, DELETE TABLE, and DROP TABLE cannot be restored once they are committed. You are advised to encapsulate such operations in transactions so that you can roll back the operations if necessary.
  • During query compilation, you are advised to list all columns to be queried and avoid using SELECT *. Doing so reduces output columns, improves query performance, and avoids the impact of adding or deleting columns on front-end service compatibility.
  • During table object access, add the schema prefix to the table object to avoid accessing an unexpected table due to schema switchover.
  • The cost of joining more than three tables or views, especially full joins, is difficult to be estimated. You are advised to use the WITH TABLE AS statement to create interim tables to improve the readability of SQL statements.
  • Avoid using Cartesian products or full joins. Cartesian products and full joins will result in a sharp expansion of result sets and poor performance.
  • Only IS NULL and IS NOT NULL can be used to determine NULL value comparison results. If any other method is used, NULL is returned. For example, NULL instead of expected Boolean values is returned for NULL<>NULL, NULL=NULL, and NULL<>1.
  • Do not use count(col) instead of count(*) to count the total number of records in a table. count(*) counts the NULL value (actual rows) while count(col) does not.
  • While executing count(col), the number of NULL record rows is counted as 0. While executing sum(col), NULL is returned if all records are NULL. If not all the records are NULL, the number of NULL record rows is counted as 0.
  • To count multiple columns using count(), column names must be enclosed in parentheses. For example, count ((col1, col2, col3)). Note: When multiple columns are used to count the number of NULL record rows, a row is counted even if all the selected columns are NULL. The result is the same as that when count(*) is executed.
  • NULL records are not counted when count(distinct col) is used to calculate the number of non-NULL columns that are not repeated.
  • If all statistical columns are NULL when count(distinct (col1,col2,...)) is used to count the number of unique values in multiple columns, Null records are also counted, and the records are considered the same.
  • Use the connection operator || to replace the concat function for string connection, because the output of the concat function depends on the data type of the strings to be connected. When the execution plan is generated, the value cannot be calculated in advance. As a result, the query performance deteriorates severely.
  • Use the time-related macros listed in Table 1 to replace the now function and obtain the current time, because the execution plan generated by the now function cannot be pushed down to disks. As a result, the query performance severely deteriorates.
    Table 1 Time-related macros

    Macro Name

    Description

    Example

    CURRENT_DATE

    Obtains the current date, excluding the hour, minute, and second details.

    1
    2
    3
    4
    5
    gaussdb=# SELECT CURRENT_DATE;
    date
    ------------
    2018-02-02
    (1 row)
    

    CURRENT_TIME

    Obtains the current time, excluding the year, month, and day.

    1
    2
    3
    4
    5
    gaussdb=# SELECT CURRENT_TIME;
    timetz
    --------------------
    00:39:34.633938+08
    (1 row)
    

    CURRENT_TIMESTAMP(n)

    Obtains the current date and time, including the year, month, day, hour, minute, second, and time zone.

    NOTE:

    n indicates the number of digits after the decimal point in the time string.

    1
    2
    3
    4
    5
    gaussdb=# SELECT CURRENT_TIMESTAMP(6);
    timestamptz
    -------------------------------
    2018-02-02 00:39:55.231689+08
    (1 row)
    
  • Do not use scalar subquery statements. A scalar subquery is a subquery in the output list of the SELECT statement. In the following example, "SELECT COUNT(*) FROM films f WHERE f.did = s.id" is a scalar subquery statement:
    1
    SELECT id, (SELECT COUNT(*) FROM films f WHERE f.did = s.id) FROM staffs_p1 s;
    

    Scalar subqueries often result in query performance deterioration. During application development, scalar subqueries need to be converted into equivalent table associations based on the service logic.

  • In WHERE clauses, the filter conditions should be collated. The condition that few records are selected for reading (the number of filtered records is small) is listed at the beginning.
  • Filter conditions in WHERE clauses should comply with unilateral rules, that is, to place the column name on one side of a comparison operator. In this way, the optimizer automatically performs pruning optimization in some scenarios. The format is col op expression, where col indicates a table column, op indicates a comparison operator, such as = and >, and expression indicates an expression that does not contain a column name. For example:
    1
    SELECT id, from_image_id, from_person_id, from_video_id FROM face_data WHERE current_timestamp(6) - time < '1 days'::interval;
    

    The modification is as follows:

    1
    SELECT id, from_image_id, from_person_id, from_video_id FROM face_data where time >  current_timestamp(6) - '1 days'::interval;
    
  • Do not perform unnecessary collation operations. Collation requires a large amount of memory and CPU. If service logic permits, ORDER BY and LIMIT can be combined to reduce resource overhead. By default, GaussDB performs collation by ASC & NULL LAST.
  • When the ORDER BY clause is used for collation, specify collation modes (ASC or DESC), and use NULL FIRST or NULL LAST for NULL record sorting.
  • Do not rely on only the LIMIT clause to return the result set displayed in a specific sequence. Combine ORDER BY and LIMIT clauses if some specific result sets are returned, and use OFFSET to skip specific results if necessary.
  • If the service logic is accurate, you are advised to use UNION ALL instead of UNION.
  • If a filter condition contains only an OR expression, convert the OR expression to UNION ALL to improve performance. SQL statements that use OR expressions cannot be optimized, resulting in slow execution. For example, the conversion of the following statements:
    1
    2
    SELECT * FROM scdc.pub_menu 
    WHERE (cdp= 300 AND inline=301) OR (cdp= 301 AND inline=302) OR (cdp= 302 ANDinline=301);
    

    Convert the statement to the following:

    1
    2
    3
    4
    5
    6
    7
    8
    SELECT * FROM scdc.pub_menu 
    WHERE (cdp= 300 AND inline=301) 
    union all
    SELECT * FROM scdc.pub_menu 
    WHERE (cdp= 301 AND inline=302) 
    union all 
    SELECT * FROM tablename 
    WHERE (cdp= 302 AND inline=301)
    
  • If an IN(val1, val2, val3…) expression contains a large number of columns, you are advised to replace it with the IN (VALUES (val1), (val2),(val3)…) statement. The optimizer will automatically convert the IN constraint into a non-correlated subquery to improve the query performance.
  • Use (NOT) EXIST instead of (NOT) IN when associated columns do not contain null values. For example, in a query statement, if the T1.C1 column does not contain any NULL value, add the NOT NULL constraint to the T1.C1 column, and then rewrite the statements.
    1
    SELECT * FROM T1 WHERE T1.C1 NOT IN (SELECT T2.C2 FROM T2);
    

    Rewrite the statement as follows:

    1
    SELECT * FROM T1 WHERE NOT EXISTS (SELECT  * FROM T2 WHERE T1.C1=T2.C2);
    
    • If the value of the T1.C1 column is not NOT NULL, the preceding rewriting cannot be performed.
    • If the T1.C1 column is the output of a subquery, check whether the output is NOT NULL based on the service logic.
  • Use cursors instead of the LIMIT OFFSET syntax to perform pagination queries to avoid resource overheads caused by multiple executions. A cursor must be used in a transaction, and you must disable the cursor and commit the transaction once the query is finished.