DWS_2000000018 Queue Congestion in the Cluster Default Resource Pool
Description
GaussDB(DWS) uses resource pools to control memory, I/O, and CPU resources, manages and allocates resources based on task priorities, and manages user services loads. For details about resource pools, see Resource Pool. When resources are insufficient, some SQL statements have to queue up to wait for other statements to be executed. For details, see CCN Queuing Under Dynamic Load Management.
GaussDB(DWS) checks the queue in the default resource pool default_pool every 5 minutes. This alarm is generated when there are SQL statements that are queued up for a long time (20 minutes by default and configurable). This alarm is automatically cleared when the alarm threshold is no longer met.
If blocked SQL statements that can trigger the alarm persists, the alarm is generated again after 24 hours (configurable).
Attributes
Alarm ID |
Alarm Category |
Alarm Severity |
Alarm Type |
Service Type |
Auto Cleared |
---|---|---|---|---|---|
DWS_2000000018 |
Management plane alarm |
Critical |
Operation alarm |
GaussDB(DWS) |
Yes |
Parameters
Category |
Name |
Description |
---|---|---|
Location information |
Name |
Queue Congestion in the Default Cluster Resource Pool |
Type |
Operation alarm |
|
Generation time |
Time when the alarm is generated |
|
Other information |
Cluster ID |
Cluster details such as resourceId and domain_id |
Impact on the System
When the default resource pool is blocked, all complex queries (estimated memory greater than or equal to 32 MB) associated with the default resource pool in the cluster may also be blocked. The queries in the queue are woken up only when the running queries are complete.
Possible Causes
- The estimated query memory usage is too large. As a result, the accumulated estimated memory usage exceeds the upper limit of the dynamic available memory, causing CCN queuing.
- Competition for public resources such as CPU and I/O deteriorates the performance of running queries.
Handling Procedure
- Check whether the queue is caused by too large estimated memory.
Rectify the fault by referring to CCN Queuing Under Dynamic Load Management.
- Check whether the available memory of the cluster is normal.
- Log in to the GaussDB(DWS) console.
- Choose Management > Alarms in the navigation pane on the left, select the current cluster from the cluster selection drop-down list in the upper right corner and view the alarm information of the cluster in the last seven days. Locate the name of the cluster that triggers the alarm based on the location information.
- On the Clusters > Dedicated Clusters page, locate the row that contains the cluster for which the alarm is generated and click Monitoring Panel in the Operation column.
- Choose Monitoring > Node Monitoring > Overview to view the memory usage of each node in the current cluster. If you want to view the historical monitoring information about the memory usage of a node, click on the right to view the memory usage in the last 1, 3, 12, or 24 hours.
If the cluster memory usage is low (for example, lower than 50%), the alarm may be generated because the estimated memory usage of queries is too large. In this case, perform the Analyze operation on related tables.
- Check the competition of other resources.
- Check the CPU, I/O, and network usage of the cluster by referring to section 2.
- If the database is fully loaded, query Real-Time Top SQL and kill the statements that occupy a large number of resources.
- Check whether too many queries are submitted in a short period of time.
- Run the following SQL statement to query the task execution status:
1 2 3 4 5 6 7 8 9 10 11 12
SELECT s.resource_pool AS rpname, s.node_group, count(1) AS session_cnt, SUM(CASE WHEN a.enqueue = 'waiting in global queue' THEN 1 ELSE 0 END) AS global_wait, SUM(CASE WHEN s.lane= 'fast' AND a.state = 'active' AND (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') THEN 1 ELSE 0 END) AS fast_run, SUM(CASE WHEN s.lane= 'fast' AND a.enqueue = 'waiting in respool queue' THEN 1 ELSE 0 END) AS fast_wait, SUM(CASE WHEN s.lane= 'slow' AND a.state = 'active' AND (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') THEN 1 ELSE 0 END) AS slow_run, SUM(CASE WHEN s.lane= 'slow' AND (a.enqueue = 'waiting in ccn queue' OR a.enqueue = 'waiting in respool queue') THEN 1 ELSE 0 END) AS slow_wait, SUM(CASE WHEN (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') AND a.state = 'active' THEN statement_mem ELSE 0 END) AS est_mem FROM pgxc_session_wlmstat s,pgxc_stat_activity a WHERE s.threadid=a.pid(+) AND s.attribute != 'Internal' GROUP BY 1,2;
The following is an example of the possible execution result of the SQL statement:1 2 3 4 5
rpname | node_group | session_cnt | global_wait | fast_run | fast_wait | slow_run | slow_wait | est_mem --------------+--------------+-------------+-------------+----------+-----------+----------+-----------+--------- default_pool | installation | 6 | 0 | 0 | 0 | 0 | 0 | 0 root | installation | 1 | 0 | 0 | 0 | 0 | 0 | 0 (2 rows)
- In the query result, if the value of slow_wait corresponding to default_pool is not 0, the cluster is fully loaded due to too many jobs. As a result, an alarm is generated. In this case, you can locate the row that contains the specified cluster on the console, choose Monitoring Panel in the Operation column. On the displayed page, choose Monitoring > Real-Time Queries to query the task with the longest execution time, and kill the task.
- If the alarm is frequently generated, you are advised to schedule services in off-peak hours or create new resource pools to manage system resources in a more refined manner. For details, see Creating a Resource Pool.
- Run the following SQL statement to query the task execution status:
Alarm Clearance
This alarm is automatically cleared when the resource pool blocking is relieved.
To view historical blocked SQL statements, locate the row that contains the target cluster on the console, choose Monitoring Panel in the Operation column. On the displayed page, choose Monitoring > History to query the execution time of historical SQL statements.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot