Before You Start: Performance Management Requirements

Effective performance management of the GaussDB(DWS) database system is vital for the entire system. To prevent frequent resource overload (such as CPU, I/O, memory, and disk space) in the cluster, it is important to control and limit the services and overall resources in the cluster. Regular proactive O&M and advance scale-out planning are also necessary.

Before introducing a new service, it is crucial to evaluate and conduct pressure tests on existing resources to avoid excessive resource consumption and negative impact on the overall cluster performance. As the data volume of existing services increases, the cluster's disk space and I/O usage also grow. Therefore, periodic clearance of aged and unnecessary data is required.

This section provides an overview of the cluster's performance baseline and outlines the performance management requirements in typical service scenarios. Its purpose is to assist users and O&M personnel in evaluating the cluster's capacity in advance and preventing resource overload.

GaussDB(DWS) Cluster Performance Baseline

In this section, you will find information about the recommended values and risk values of GaussDB(DWS) resources.

When the resource watermark exceeds the recommended value, it is crucial for O&M personnel to promptly address the issue to prevent performance degradation in scenarios such as node faults and active/standby switchover.

Exceeding the risk value for the cluster resource watermark indicates potential overload. In such cases, it is advisable to refrain from introducing new services.

Instead, it is necessary to swiftly reduce the overall cluster load through service optimization or scheduling tasks during off-peak hours. If needed, the cluster can be divided or its capacity expanded to ensure no impact on overall performance.

**Table 1** Cluster Performance and Capacity Risks and Suggestions
Metric	Recommended Value	Impact of Exceeding the Recommended Value	Recommended Measure	Risk Value	Impact of Exceeding the Risk Value	Recommended Measure
CPU usage	Less than 60%	When the active/standby nodes are unbalanced or a node is faulty, the CPU usage of some nodes may be overloaded, causing performance degradation.	Configure a resource pool for resource isolation. For details, see GaussDB(DWS) Resource Load Management. Use Real-Time Queries and Performance Monitoring to capture statements with high CPU usage for service optimization. For details, see Monitoring and Diagnosing Top SQL Statements in a GaussDB(DWS) Cluster and .	80%	Severe CPU contention occurs. As a result, the execution time of operators such as Stream deteriorates, and the overall cluster performance is severely affected.	Reduce the CPU load during peak hours by means of service staggering, service splitting, service optimization, and cluster scale-out. You can also set the CPU limit and quota of the resource pool. For details, see advanced system tuning operations in Tuning Systems with High CPU Usage.
CPU skew	Less than 15%	Computing skew occurs. As a result, the optimal performance of some statements in the distributed system cannot be fully utilized.	Configure rules introducing in Exception Rules and circuit breakers to fallbreak skew statements in advance. Optimize such services on a daily basis.	30%	During peak hours, a single node's CPU may become overloaded, causing overall cluster performance to deteriorate due to Liebig's Law of the Minimum. This prevents other nodes from being fully utilized.	Configure rules introducing in Exception Rules and circuit breakers to preemptively handle skewed statements and optimize services regularly.
I/O usage	Less than 60%	When the active/standby status is unbalanced or a node fails, some nodes may experience I/O overload, leading to performance degradation.	Find out the services with high I/O usage by checking the monitoring data. For details, see Performance Monitoring. You can reduce the disk I/O usage by indexing, partition pruning, and row-column storage rectification.	90%	Severe I/O contention can occur, affecting operators such as table scanning and overall cluster performance.	Optimize high-I/O statements and stagger peak hours to maintain I/O performance. Plan for cluster scale-out in advance to reduce the I/O burden on individual nodes.
I/O read/write latency	Less than 400 milliseconds	Performance fluctuations during data read and write operations can lead to unstable query times and occasional performance degradation.	Find out the services with high I/O usage by checking the monitoring data. For details, see Performance Monitoring. You can reduce the disk I/O usage by indexing, partition pruning, and row-column storage rectification to reduce the read/write latency.	1000 ms	Significant deterioration in data read/write performance can cause real-time data storage services to back up, impacting overall performance.	Optimize high-I/O, high-disk, and high-concurrency statements to stagger service peaks and distribute the load more evenly.
Dynamic memory usage	Less than 80%	When the service traffic increases sharply or complex flexible queries are executed, an error may be reported due to insufficient memory.	Configure exception rules and memory circuit breaker. Optimize memory-intensive services by referring to Real-Time Queries and Monitoring and Diagnosing Top SQL Statements in a GaussDB(DWS) Cluster. For how to reduce the memory usage, see Reducing Memory Usage.	90%	CCN queuing occurs, an error indicating insufficient memory is reported, and process OOM risks exist.	Configure exception rules and memory circuit breaker. Optimize memory-intensive services by referring to Real-Time Queries and Monitoring and Diagnosing Top SQL Statements in a GaussDB(DWS) Cluster.
Disk space usage	Less than 70%	The risk of read-only status increases when SQL statements are written to disks and the disk usage exceeds 90%.	Set thresholds for triggering disk flushing, clear data and dirty pages during off-peak hours, and plan for scale-out in advance. For details, see Solution to High Disk Usage and Cluster Read-Only.	80%	The read-only risk increases after SQL statements are written to disks.	Set thresholds for triggering disk flushing, clear data and dirty pages during off-peak hours, and plan for scale-out in advance.
Disk space skew	Less than 15%	Severe skew occurs during operator computing or data spill to the disk. The workloads will be unevenly distributed on DNs, resulting in high disk usage on a single DN and affecting performance.	Check and handle table skew by referring to Table Diagnosis.	20%	Disk skew causes CPU, I/O, and memory skew, which affects the overall cluster performance and may cause the disk of a single DN to be full.	Handle table skew by referring to Table Diagnosis.

GaussDB(DWS) Performance Management Scenarios and Suggestions

This section introduces common performance management scenarios and offers suggestions. During service rollout and routine O&M, you need to thoroughly assess the performance capacity to avoid overloading the cluster.

**Table 2** Performance management scenarios
Scenario	Performance Risk	Evaluation Method	Suggestion
New cluster rollout	The performance and capacity of the new cluster are uncertain before the service rollout, and there is a possibility that they may not meet the requirements.	Before launching the service, conduct a pressure test on the cluster. Both the new and old clusters should be operational for at least one service period. It is necessary to thoroughly test key services and links for performance metrics such as QPS, latency, maximum concurrency, and maximum response time. This will ensure a comprehensive evaluation of the performance and capacity of the new cluster.	Implement dynamic resource management and allocate service resource pools accordingly by referring to GaussDB(DWS) Resource Load Management. Configure exception rules in advance and configure circuit breaker parameters.
New service rollout	Resource preemption may arise, impacting existing services in the cluster. If new services are executed concurrently and consume resources improperly, it can result in resource overload and a decline in overall performance.	Conduct a thorough test on the new service in a test environment. Based on the test results, estimate the CPU usage, execution time, and number of concurrent services. Analyze the execution plan for the new services to ensure optimal performance.	Roll out a cluster only when the cluster's performance capacity is sufficient. Isolate new services with resource pools. Configure circuit breakers appropriately according to the test results and produce a rollback solution to swiftly revert services in the event of a fault.
Flexible query performance management	There are different types of SQL statements that offer flexibility in querying, but their execution efficiency and resource consumption can vary significantly. In extreme cases, a slow SQL statement can negatively impact the performance of the entire cluster.	To address this, you can gather statistics on CPU usage, memory usage, execution time, and the number of concurrent queries. For details, see Real-Time Queries.	For users who frequently use flexible queries, allocate them to separate resource pools that are independent of other services. This allows for better CPU and memory resource management. To promptly handle slow SQL statements, configure exception rules and circuit breakers. Remember to follow the Liebig's Law of the Minimum when granting permissions to these users. The administrator account should not be used as the primary account for flexible queries.
Inventory business increase	As services grow and more data is generated, the cluster's resource usage increases. If the cluster resources are not managed promptly, there may be a risk of overload.	Collect statistics on various metrics like dirty data, skew rate, ANALYZE time, number of partitions, and resource consumption of inventory services on a regular basis.	Inspect the cluster weekly, clearing dirty data from tables with a high dirty page rate, and performing ANALYZE on tables that have not had their statistics collected in a timely manner.