Help Center/ GaussDB(DWS)/ User Guide/ Monitoring and Alarms/ Alarms/ Alarm Handling/ DWS_2000000018 Queue Congestion in the Cluster Default Resource Pool
Updated on 2024-08-26 GMT+08:00

DWS_2000000018 Queue Congestion in the Cluster Default Resource Pool

Description

GaussDB(DWS) uses resource pools to control memory, I/O, and CPU resources, manages and allocates resources based on task priorities, and manages user services loads. For details about resource pools, see Resource Pool. When resources are insufficient, some SQL statements have to queue up to wait for other statements to be executed. For details, see CCN Queuing Under Dynamic Load Management.

GaussDB(DWS) checks the queue in the default resource pool default_pool every 5 minutes. This alarm is generated when there are SQL statements that are queued up for a long time (20 minutes by default and configurable). This alarm is automatically cleared when the alarm threshold is no longer met.

If blocked SQL statements that can trigger the alarm persists, the alarm is generated again after 24 hours (configurable).

Attributes

Alarm ID

Alarm Category

Alarm Severity

Alarm Type

Service Type

Auto Cleared

DWS_2000000018

Management plane alarm

Critical

Operation alarm

GaussDB(DWS)

Yes

Parameters

Category

Name

Description

Location information

Name

Queue Congestion in the Default Cluster Resource Pool

Type

Operation alarm

Generation time

Time when the alarm is generated

Other information

Cluster ID

Cluster details such as resourceId and domain_id

Impact on the System

When the default resource pool is blocked, all complex queries (estimated memory greater than or equal to 32 MB) associated with the default resource pool in the cluster may also be blocked. The queries in the queue are woken up only when the running queries are complete.

Possible Causes

  • The estimated query memory usage is too large. As a result, the accumulated estimated memory usage exceeds the upper limit of the dynamic available memory, causing CCN queuing.
  • Competition for public resources such as CPU and I/O deteriorates the performance of running queries.

Handling Procedure

  1. Check whether the queue is caused by too large estimated memory.

    Rectify the fault by referring to CCN Queuing Under Dynamic Load Management.

  2. Check whether the available memory of the cluster is normal.

    1. Log in to the GaussDB(DWS) console.
    2. Choose Management > Alarms in the navigation pane on the left, select the current cluster from the cluster selection drop-down list in the upper right corner and view the alarm information of the cluster in the last seven days. Locate the name of the cluster that triggers the alarm based on the location information.
    3. On the Clusters > Dedicated Clusters page, locate the row that contains the cluster for which the alarm is generated and click Monitoring Panel in the Operation column.
    4. Choose Monitoring > Node Monitoring > Overview to view the memory usage of each node in the current cluster. If you want to view the historical monitoring information about the memory usage of a node, click on the right to view the memory usage in the last 1, 3, 12, or 24 hours.

      If the cluster memory usage is low (for example, lower than 50%), the alarm may be generated because the estimated memory usage of queries is too large. In this case, perform the Analyze operation on related tables.

  3. Check the competition of other resources.

    1. Check the CPU, I/O, and network usage of the cluster by referring to section 2.
    2. If the database is fully loaded, query Real-Time Top SQL and kill the statements that occupy a large number of resources.

  4. Check whether too many queries are submitted in a short period of time.

    1. Run the following SQL statement to query the task execution status:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      SELECT 
         s.resource_pool AS rpname, s.node_group, 
         count(1) AS session_cnt, 
         SUM(CASE WHEN a.enqueue = 'waiting in global queue' THEN 1 ELSE 0 END) AS global_wait, 
         SUM(CASE WHEN s.lane= 'fast' AND a.state = 'active' AND (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') THEN 1 ELSE 0 END) AS fast_run, 
         SUM(CASE WHEN s.lane= 'fast' AND a.enqueue = 'waiting in respool queue' THEN 1 ELSE 0 END) AS fast_wait, 
         SUM(CASE WHEN s.lane= 'slow' AND a.state = 'active' AND (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') THEN 1 ELSE 0  END) AS slow_run, 
         SUM(CASE WHEN s.lane= 'slow' AND (a.enqueue = 'waiting in ccn queue' OR a.enqueue = 'waiting in respool queue') THEN 1 ELSE 0  END) AS slow_wait, 
         SUM(CASE WHEN (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') AND a.state = 'active' THEN statement_mem ELSE 0 END) AS est_mem 
      FROM pgxc_session_wlmstat s,pgxc_stat_activity a 
      WHERE s.threadid=a.pid(+) AND s.attribute != 'Internal' 
      GROUP BY 1,2;
      
      The following is an example of the possible execution result of the SQL statement:
      1
      2
      3
      4
      5
          rpname    |  node_group  | session_cnt | global_wait | fast_run | fast_wait | slow_run | slow_wait | est_mem 
      --------------+--------------+-------------+-------------+----------+-----------+----------+-----------+---------
       default_pool | installation |           6 |           0 |        0 |         0 |        0 |         0 |       0
       root         | installation |           1 |           0 |        0 |         0 |        0 |         0 |       0
      (2 rows)
      
    • In the query result, if the value of slow_wait corresponding to default_pool is not 0, the cluster is fully loaded due to too many jobs. As a result, an alarm is generated. In this case, you can locate the row that contains the specified cluster on the console, choose Monitoring Panel in the Operation column. On the displayed page, choose Monitoring > Real-Time Queries to query the task with the longest execution time, and kill the task.
    • If the alarm is frequently generated, you are advised to schedule services in off-peak hours or create new resource pools to manage system resources in a more refined manner. For details, see Creating a Resource Pool.

Alarm Clearance

This alarm is automatically cleared when the resource pool blocking is relieved.

To view historical blocked SQL statements, locate the row that contains the target cluster on the console, choose Monitoring Panel in the Operation column. On the displayed page, choose Monitoring > History to query the execution time of historical SQL statements.