DWS_2000000018 Queue Congestion in the Cluster Default Resource Pool

Description

GaussDB(DWS) uses resource pools to control memory, I/O, and CPU resources, manages and allocates resources based on task priorities, and manages user services loads. For details about resource pools, see Resource Pool. When resources are insufficient, some SQL statements have to queue up to wait for other statements to be executed. For details, see CCN Queuing Under Dynamic Load Management.

GaussDB(DWS) checks the queue in the default resource pool default_pool every 5 minutes. This alarm is generated when there are SQL statements that are queued up for a long time (20 minutes by default and configurable). This alarm is automatically cleared when the alarm threshold is no longer met.

If blocked SQL statements that can trigger the alarm persists, the alarm is generated again after 24 hours (configurable).

Attributes

Alarm ID	Alarm Category	Alarm Severity	Alarm Type	Service Type	Auto Cleared
DWS_2000000018	Management plane alarm	Critical	Operation alarm	GaussDB(DWS)	Yes

Parameters

Category	Name	Description
Location information	Name	Queue Congestion in the Default Cluster Resource Pool
	Type	Operation alarm
	Generation time	Time when the alarm is generated
Other information	Cluster ID	Cluster details such as resourceId and domain_id

Impact on the System

When the default resource pool is blocked, all complex queries (estimated memory greater than or equal to 32 MB) associated with the default resource pool in the cluster may also be blocked. The queries in the queue are woken up only when the running queries are complete.

Possible Causes

The estimated query memory usage is too large. As a result, the accumulated estimated memory usage exceeds the upper limit of the dynamic available memory, causing CCN queuing.
Competition for public resources such as CPU and I/O deteriorates the performance of running queries.

Handling Procedure

Check whether the queue is caused by too large estimated memory.

Rectify the fault by referring to CCN Queuing Under Dynamic Load Management.
Check whether the available memory of the cluster is normal.
1. Log in to the GaussDB(DWS) console.
2. Choose Management > Alarms in the navigation pane on the left, select the current cluster from the cluster selection drop-down list in the upper right corner and view the alarm information of the cluster in the last seven days. Locate the name of the cluster that triggers the alarm based on the location information.
3. On the Clusters > Dedicated Clusters page, locate the row that contains the cluster for which the alarm is generated and click Monitoring Panel in the Operation column.
4. Choose Monitoring > Node Monitoring > Overview to view the memory usage of each node in the current cluster. If you want to view the historical monitoring information about the memory usage of a node, click on the right to view the memory usage in the last 1, 3, 12, or 24 hours.
  If the cluster memory usage is low (for example, lower than 50%), the alarm may be generated because the estimated memory usage of queries is too large. In this case, perform the Analyze operation on related tables.
Check the competition of other resources.
1. Check the CPU, I/O, and network usage of the cluster by referring to section 2.
2. If the database is fully loaded, query Real-Time Top SQL and kill the statements that occupy a large number of resources.

Check whether too many queries are submitted in a short period of time.

Run the following SQL statement to query the task execution status:

        
             SELECT 
   s.resource_pool AS rpname, s.node_group, 
   count(1) AS session_cnt, 
   SUM(CASE WHEN a.enqueue = 'waiting in global queue' THEN 1 ELSE 0 END) AS global_wait, 
   SUM(CASE WHEN s.lane= 'fast' AND a.state = 'active' AND (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') THEN 1 ELSE 0 END) AS fast_run, 
   SUM(CASE WHEN s.lane= 'fast' AND a.enqueue = 'waiting in respool queue' THEN 1 ELSE 0 END) AS fast_wait, 
   SUM(CASE WHEN s.lane= 'slow' AND a.state = 'active' AND (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') THEN 1 ELSE 0  END) AS slow_run, 
   SUM(CASE WHEN s.lane= 'slow' AND (a.enqueue = 'waiting in ccn queue' OR a.enqueue = 'waiting in respool queue') THEN 1 ELSE 0  END) AS slow_wait, 
   SUM(CASE WHEN (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') AND a.state = 'active' THEN statement_mem ELSE 0 END) AS est_mem 
FROM pgxc_session_wlmstat s,pgxc_stat_activity a 
WHERE s.threadid=a.pid(+) AND s.attribute != 'Internal' 
GROUP BY 1,2;

The following is an example of the possible execution result of the SQL statement:

         
                  rpname    |  node_group  | session_cnt | global_wait | fast_run | fast_wait | slow_run | slow_wait | est_mem 
--------------+--------------+-------------+-------------+----------+-----------+----------+-----------+---------
 default_pool | installation |           6 |           0 |        0 |         0 |        0 |         0 |       0
 root         | installation |           1 |           0 |        0 |         0 |        0 |         0 |       0
(2 rows)

In the query result, if the value of slow_wait corresponding to default_pool is not 0, the cluster is fully loaded due to too many jobs. As a result, an alarm is generated. In this case, you can locate the row that contains the specified cluster on the console, choose Monitoring Panel in the Operation column. On the displayed page, choose Monitoring > Real-Time Queries to query the task with the longest execution time, and kill the task.
If the alarm is frequently generated, you are advised to schedule services in off-peak hours or create new resource pools to manage system resources in a more refined manner. For details, see Creating a Resource Pool.

Alarm Clearance

This alarm is automatically cleared when the resource pool blocking is relieved.

To view historical blocked SQL statements, locate the row that contains the target cluster on the console, choose Monitoring Panel in the Operation column. On the displayed page, choose Monitoring > History to query the execution time of historical SQL statements.

Parent topic: Alarm Handling

Previous topic: DWS_2000000017 Number of Queuing Query Statements Exceeds the Threshold

Next topic: DWS_2000000020 SQL Probe of the Cluster Usage Exceeds the Threshold