Collecting Statistics on and Diagnosing GaussDB(DWS) Data Tables

GaussDB(DWS) provides statistics and diagnostic tools for you to learn table status, including:

Skew Rate: Improper distribution columns can cause severe skew during operator computing or data spill to disk. The workloads will be unevenly distributed on DNs, resulting in high disk usage on a single DN and affecting performance. You can query the table skew rate. (This page displays information about the 50 largest tables whose skew rate is higher than 5%.) Based on the table size and skew rate, you can reselect distribution columns for tables with severe skew. For clusters of version 8.1.0 or later, you can use the ALTER TABLE syntax to directly make the adjustment. For the adjustments for other versions, see How Do I Change Distribution Columns?
Dirty Page Rate: DML operations on tables may generate dirty data, which unnecessarily occupies cluster storage. To handle large tables and those with a high dirty page rate, you can query the dirty page rate of a table and take into account its size and dirty page rate. This page displays information about the 50 largest tables whose dirty page rate is higher than 50%. For how to rectify the fault, see Solution to High Disk Usage and Cluster Read-Only.
DDL Audit: DDL review is a type of SQL review. To prevent improper DDL design from affecting services, this tool checks whether DDL metadata is standard, detecting potential table definition problems in advance. The check result can also be used as a reference for locating performance issues.

Notes and Constraints

Only 8.1.1.x and later versions support the table skew rate and dirty page rate features. For earlier versions, contact technical support.
Only 8.1.1.300 and later versions support the DDL review feature. For earlier versions, contact technical support.
The collection period of the table skew rate and dirty page rate can be configured on the Monitoring Collection page of the cluster. Frequent collection may affect cluster performance. Set a proper period based on your cluster workloads.

Skew Rate

Log in to the DWS console.
Choose Dedicated Clusters > Clusters and locate the cluster to be monitored.
In the Operation column of the target cluster, click Monitoring Panel.
In the navigation pane on the left, choose Tools > Table Diagnosis and click the Skew Rate tab. The tables that meet the statistics collection conditions in the cluster are displayed (top 50 tables with the skew rate higher than 5%).

This page uses a unique table skew rate calculation method. It computes the skew rate using this formula: (Largest table size per DN – Smallest table size per DN)/Average table size per DN.

To find the needed values, check the dnsize column in the gs_table_distribution view.

Dirty Page Rate

Log in to the DWS console.
Choose Dedicated Clusters > Clusters and locate the cluster to be monitored.
In the Operation column of the target cluster, click Monitoring Panel.
In the navigation pane on the left, choose Tools > Table Diagnosis and click the Dirty Page Rate tab. The tables that meet the statistics collection conditions in the cluster are displayed.

DDL Audit

Log in to the DWS console.
Choose Dedicated Clusters > Clusters and locate the cluster to be monitored.
In the Operation column of the target cluster, click Monitoring Panel.

In the navigation pane on the left, choose Tools > Table Diagnosis, and click the DDL Audit tab. The audit results are displayed.

The selected audit items are displayed on the DDL Audit tab by default. You can configure the audit items on the Monitoring Collection page. For more information, see Table 1.

**Table 1** Audit items
Item	Description
Number of Distribution Keys (disKeyCount)	If there is no data skew, use no more than four distribution keys. Generally, if you use many distribution keys, data can be evenly distributed in a cluster, thus avoid data skew. However, if too many distribution keys are used, the storage performance and joint query performance may deteriorate. You are advised to configure no more than four distribution keys. Storage performance issue: When data is added, the hash function calculates the result of each distribution column, aggregates the results, and then determine where to distribute data. A large number of distribution keys require time-consuming, complex calculation. Union query performance issue: During multi-table join query, if all the columns of the distribution key are involved in the join condition, data does not need to be redistributed in the execution plan. If a large number of distribution keys are used, some of them may not be the columns involved in the join condition, and data redistribution may occur, which consumes many resources and takes long.
Number of Index Columns/PCKs (indexKeyOrPckCount)	It is recommended that the number of partial cluster keys (PCKs)/columns of an index be less than or equal to 4. A large number of index columns require many resources to maintain index data, and are likely to contain duplicate indexes. While column-store data is imported, PCK columns are compared and calculated to determine CU division. A large number of PCKs will consume many resources and much time, affecting performance. To efficiently filter CUs in a query, the prefixes of the columns involved in the query conditions must be PCK columns. (For example, if the PCK columns are a, b, and c, the query criteria must be a>? and b>? and c>?.) Otherwise, all the CUs must be traversed, and data clustering does not contribute to query acceleration.
Invalid PCKs (invalidPck)	Do not create invalid PCK columns. In 8.1.1 and later versions, the cluster can filter and compare data of the char, int8, int2, int4, text, bpchar, varchar, date, time, timestamp and timestamptz types. If a column of an unsupported data type is used as a PCK, the column is an invalid PCK column. It does not take effect during CU filtering and will consume resources for its maintenance.
numeric Data Usage (validityOfNumeric)	When numeric data types are used, use integers if possible. If the precision requirement is not high, use the float fixed-length data type. The float fixed-length data type has better computing performance than the numeric variable-length data type. That is, if the numeric type is used, you are advised to specify the scale and precision within 38 bits. When the numeric type is used for calculation, the underlying layer attempts to convert the calculation to the calculation between int and bigint to improve the calculation efficiency. That is, the large integer optimization of the data type is used. In 8.1.1 and later versions, if no precision is specified, a maximum of 131,072 digits can be placed before the decimal point and a maximum of 16,383 digits can be placed after the decimal point. That is, the maximum scale and precision are used. In this case, large integer optimization cannot be performed during calculation, and the calculation efficiency decreases accordingly.
Index Column Width (widthOfIndexKey)	Generally, wide index columns are character string columns, which do not involve compare operations and will lead to large indexes that consume unnecessary space. Specify a value smaller than 64 bytes.
Replication Table Size (sizeOfCopyTable)	Tables that occupy more storage space than the threshold (100 MB) on a single DN will be identified. For such tables, you are advised to use common associated columns as distribution keys (generally with one primary key). The cluster supports replication tables. A replication table maintains a full copy of data on each node and is mainly used to store data of enumerated types. If a table contains too much data, it will occupy a large amount of space. In addition, in a union query, the node traverses all table data, which may take a longer time than the union query after the table type is changed to distribution table. (Although data may be redistributed in the distribution table, the amount of data traversed by each node decreases.)
Skew Detection for Single-Distribution-Key Tables (recognitionOfDataSkew)	Data skew of single-distribution-key tables is detected by statistics. This audit applies only to tables with one distribution key.
Distribution Key Usage (validityOfDiskey)	In a cluster, you are not advised to use a column of the Boolean or date type as a distribution column, because it may cause data skew.
Number of Cached Sequence Values (cacheSizeOfSequence)	Specify a number greater than 100. If a table column uses sequences, its next_value is obtained from the cached value in the local node. If cached sequence values are used up, a request will be sent asking GTM to obtain the value again. If a large amount of data is added but only a few values are cached, GTM will receive many requests, and may get overloaded and even break down. To avoid this problem, you are advised to set the cache value to a value greater than 100 when creating a sequence.
Optimizable Indexes (optimizableIndexKey)	Scenarios where indexes can be optimized: The index column of an index is the first N columns of another index. The index columns of two indexes are the same, but the orders are different.

If the review result of an item is Failed, click View to go to the details page.
Click Export in the upper left corner to export the audit result.

Manually Auditing DDL Items

Log in to the DWS console.
Choose Dedicated Clusters > Clusters and locate the cluster to be monitored.
In the Operation column of the target cluster, click Monitoring Panel. The database monitoring page is displayed.
In the navigation pane on the left, choose Tools > Table Diagnosis, and click the DDL Audit tab. On the page that is displayed, select the items to be audited and click One-Click Audit.