Help Center> Data Warehouse Service (DWS)> Management Guide> Monitoring and Alarms> Alarms> Alarm Handling> DWS_2000000018 Queue Congestion in the Cluster Default Resource Pool
Updated on 2024-06-20 GMT+08:00

DWS_2000000018 Queue Congestion in the Cluster Default Resource Pool

Description

GaussDB(DWS) uses resource pools to control memory, I/O, and CPU resources, manages and allocates resources based on task priorities, and manages user services loads. For details about resource pools, see Resource Pool. When resources are insufficient, some SQL statements have to queue up to wait for other statements to be executed. For details, see CCN Queuing Under Dynamic Load Management.

GaussDB(DWS) checks the queue in the default resource pool default_pool every 5 minutes. This alarm is generated when there are SQL statements that are queued up for a long time (20 minutes by default and configurable). This alarm is automatically cleared when the alarm threshold is no longer met.

If blocked SQL statements that can trigger the alarm persists, the alarm is generated again after 24 hours (configurable).

Attributes

Alarm ID

Alarm Severity

Auto Clear

DWS_2000000018

Critical

Yes

Parameters

Parameter

Description

Source

Name of the system for which the alarm is generated and the detailed alarm type.

Cluster Name

ID of the cluster for which the alarm is generated

Location Information

ID and name of the cluster for which the alarm is generated

Alarm Information

CloudService indicates the cloud service for which the alarm is generated, including the service name, resource ID, resource name, first alarm time, and formatted alarm information. Example: CloudServiceDWS, resourceId=xxxx-xxxx-xxxx-xxxx, resourceIdName=test_dws, first_alarm_time:2023-01-11:19:02:09. The default resource pool of cluster test_dws are blocked within in the past 20 minutes.

Time

Time when the alarm was generated.

Status

Current status of an alarm.

Impact on the System

When the default resource pool is blocked, all complex queries (estimated memory greater than or equal to 32 MB) associated with the default resource pool in the cluster may also be blocked. The queries in the queue are woken up only when the running queries are complete.

Possible Causes

  • The estimated query memory usage is too large. As a result, the accumulated estimated memory usage exceeds the upper limit of the dynamic available memory, causing CCN queuing.
  • Competition for public resources such as CPU and I/O deteriorates the performance of running queries.

Handling Procedure

  1. Check whether the queue is caused by too large estimated memory.

    Rectify the fault by referring to CCN Queuing Under Dynamic Load Management.

  2. Check whether the available memory of the cluster is normal.

    1. Log in to the GaussDB(DWS) console.
    2. On the Alarms page, select the current cluster from the cluster selection drop-down list in the upper right corner and view the alarm information of the cluster in the last seven days. Locate the name of the cluster that triggers the alarm based on the location information.
    3. On the Clusters > Dedicated Clusters page, locate the row that contains the cluster for which the alarm is generated and click Monitoring Panel in the Operation column.
    4. Choose Monitoring > Node Monitoring > Overview to view the memory usage of each node in the current cluster. If you want to view the historical monitoring information about the memory usage of a node, click on the right to view the memory usage in the last 1, 3, 12, or 24 hours.

      If the cluster memory usage is low (for example, lower than 50%), the alarm may be generated because the estimated memory usage of queries is too large. In this case, perform the Analyze operation on related tables.

  3. Check the competition of other resources.

    1. Check the CPU, I/O, and network usage of the cluster by referring to section 2.
    2. If the database is fully loaded, query Real-Time Top SQL and kill the statements that occupy a large number of resources.

  4. Check whether too many queries are submitted in a short period of time.

    1. Run the following SQL statement to query the task execution status:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      SELECT 
         s.resource_pool AS rpname, s.node_group, 
         count(1) AS session_cnt, 
         SUM(CASE WHEN a.enqueue = 'waiting in global queue' THEN 1 ELSE 0 END) AS global_wait, 
         SUM(CASE WHEN s.lane= 'fast' AND a.state = 'active' AND (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') THEN 1 ELSE 0 END) AS fast_run, 
         SUM(CASE WHEN s.lane= 'fast' AND a.enqueue = 'waiting in respool queue' THEN 1 ELSE 0 END) AS fast_wait, 
         SUM(CASE WHEN s.lane= 'slow' AND a.state = 'active' AND (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') THEN 1 ELSE 0  END) AS slow_run, 
         SUM(CASE WHEN s.lane= 'slow' AND (a.enqueue = 'waiting in ccn queue' OR a.enqueue = 'waiting in respool queue') THEN 1 ELSE 0  END) AS slow_wait, 
         SUM(CASE WHEN (a.enqueue IS NULL OR a.enqueue = 'no waiting queue') AND a.state = 'active' THEN statement_mem ELSE 0 END) AS est_mem 
      FROM pgxc_session_wlmstat s,pgxc_stat_activity a 
      WHERE s.threadid=a.pid(+) AND s.attribute != 'Internal' 
      GROUP BY 1,2;
      
      The following is an example of the possible execution result of the SQL statement:
      1
      2
      3
      4
      5
          rpname    |  node_group  | session_cnt | global_wait | fast_run | fast_wait | slow_run | slow_wait | est_mem 
      --------------+--------------+-------------+-------------+----------+-----------+----------+-----------+---------
       default_pool | installation |           6 |           0 |        0 |         0 |        0 |         0 |       0
       root         | installation |           1 |           0 |        0 |         0 |        0 |         0 |       0
      (2 rows)
      
    • In the query result, if the value of slow_wait corresponding to default_pool is not 0, the cluster is fully loaded due to too many jobs. As a result, an alarm is generated. In this case, you can locate the row that contains the specified cluster on the console, choose Monitoring Panel in the Operation column. On the displayed page, choose Monitoring > Real-Time Queries to query the task with the longest execution time, and kill the task.
    • If the alarm is frequently generated, you are advised to schedule services in off-peak hours or create new resource pools to manage system resources in a more refined manner. For details, see Creating a Resource Pool.

Alarm Clearance

This alarm is automatically cleared when the resource pool blocking is relieved.

To view historical blocked SQL statements, locate the row that contains the target cluster on the console, choose Monitoring Panel in the Operation column. On the displayed page, choose Monitoring > History to query the execution time of historical SQL statements.