Updated on 2025-09-25 GMT+08:00

Alarm Management

Overview

Alarm management includes viewing and configuring alarm rules and subscribing to alarm information. Alarm rules display alarm statistics and details of the past week for users to view tenant alarms. In addition to providing a set of default DWS alarm rules, this feature allows you to modify alarm thresholds based on your own services. DWS alarm notifications are sent using the SMN service.

  • This feature is supported only in cluster version 8.1.1.200 and later.
  • Currently, alarms cannot be categorized and managed by enterprise project.

Visiting the Alarms Page

  1. Log in to the DWS console.
  2. In the navigation tree on the left, choose Monitoring > Alarm.
  3. On the page that is displayed:

    • Existing Alarm Statistics

      Statistics of the existing alarms in the past seven days are displayed by alarm severity in a bar chart. In this way, you can see clearly the number and category of the alarms generated in the past week.

    • Today's Alarms

      Statistics of the existing alarms on the current day are displayed by alarm severity in a list. In this way, you can see clearly the number and category of the unhandled alarms generated on the day.

    • Alarm details

      Details about all alarms, handled and unhandled, in the past seven days are displayed in a table for you to quickly locate faults, including the alarm name, alarm severity, alarm source, cluster name, location, description, generation date, and status.

    The alarm data displayed (a maximum of 30 days) is supported by the Event Service microservice.

Alarms

The alarm policy is triggered based on the current configuration.

Table 1 Alarms

Alarm Name

Alarm Severity

Default Alarm Threshold

Alarm Description (Calculation Method)

Active Session Usage in a DWS Cluster Exceeds the Threshold

Major

80%

This major alarm is generated by the DMS alarm module if the session usage (number of active SQL statements in real-time top SQL statements/max_active_statements) in the cluster goes beyond 80% (configurable) within a specific period and the suppression conditions are not met. The alarm will be cleared once the session usage in the cluster drops below the threshold.

Active Session Usage in a DWS Cluster Exceeds the Threshold

Critical

90%

This critical alarm is generated by the DMS alarm module if the session usage (number of real-time and active SQL statements obtained from max_active_statements) in the cluster goes beyond 90% (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the session usage in the cluster drops below the threshold.

DWS Audit Log Dump Exception

Warning

-

After the log dump function is enabled in security settings, this alarm is generated when an exception occurs during periodic log dump.

Schema Usage of the DWS Cluster Exceeds the Threshold

Warning

-

This warning alarm is generated by the DMS alarm module if the schema usage (obtained from pgxc_total_schema_info) in the cluster goes beyond the set threshold within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the schema usage in the cluster drops below the threshold.

Schema Usage of the DWS Cluster Exceeds the Threshold

Minor

-

This minor alarm is generated by the DMS alarm module if the schema usage (obtained from pgxc_total_schema_info) in the cluster goes beyond the set threshold within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the schema usage in the cluster drops below the threshold.

Schema Usage of the DWS Cluster Exceeds the Threshold

Major

-

This major alarm is generated by the DMS alarm module if the schema usage (obtained from pgxc_total_schema_info) in the cluster goes beyond the set threshold within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the schema usage in the cluster drops below the threshold.

Schema Usage of the DWS Cluster Exceeds the Threshold

Critical

80%

This critical alarm is generated by the DMS alarm module if the schema usage (obtained from pgxc_total_schema_info) in the cluster goes beyond 80% (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the schema usage in the cluster drops below the threshold.

Remaining DWS Database Disk Capacity Is Insufficient

Critical

90%

This alarm is generated when the disk or inode usage (set by parameter datastorage_threshold_value_check) of a cluster instance reaches 90%, and the cluster is marked as read-only. This alarm is cleared when the disk or inode usage of the cluster instance is lower than 90%.

Disk Usage of a DWS Cluster Resource Pool Exceeds the Threshold

Major

80%

This major alarm is generated by the DMS alarm module if the disk usage of the cluster resource pool (obtained from pg_resource_pool and disk_usage) goes beyond 80% (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the disk usage of the cluster resource pool drops below the threshold.

Disk Usage of a DWS Cluster Resource Pool Exceeds the Threshold

Critical

90%

This critical alarm is generated by the DMS alarm module if the disk usage of the cluster resource pool (obtained from pg_resource_pool and disk_usage) goes beyond 90% (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the disk usage of the cluster resource pool drops below the threshold.

Number of Database Deadlocks in a DWS Cluster Exceeds the Threshold

Major

1

This major alarm is generated by the DMS alarm module if the number of deadlocks in the cluster database (obtained from global_stat_database and deadlocks) exceeds 1 (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the number of deadlocks in the cluster database drops below the threshold.

Number of Database Deadlocks in a DWS Cluster Exceeds the Threshold

Critical

10

This critical alarm is generated by the DMS alarm module if the number of deadlocks in the cluster database (obtained from global_stat_database and deadlocks) exceeds 10 (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the number of deadlocks in the cluster database drops below the threshold.

Imbalanced DWS Cluster Load

Critical

-

This alarm is generated when the primary/standby relationship of instances in a cluster changes to be different from that during initial installation of the cluster.

SQL on Hudi Tasks in a DWS Cluster Fail

Critical

-

gs_scheduler in DWS periodically starts SQL on Hudi tasks to synchronize data between user internal tables and Hudi foreign tables. gs_scheduler reads the what field in scheduler.pg_task and executes the SQL statement as super administrator No. 10. This alarm is generated when the SQL statement fails to be executed for more than three consecutive times, and is automatically cleared when the SQL statement is successfully executed.

Node Data Disk I/O Usage Exceeds the Threshold

Critical

90%

DWS collects the data disk I/O usage of each cluster node every 30 seconds. This alarm is generated when the average usage of a data disk on a node exceeds 90% (configurable) in the last 10 minutes (configurable), and is automatically cleared when the average usage drops below 85% (alarm threshold minus 5%).

Node Data Disk Usage Exceeds the Threshold

Major

80%

DWS collects the usage of all disks on each node in a cluster every 30 seconds.

If the maximum disk usage in the last 10 minutes (configurable) exceeds 80% (configurable), a major alarm is reported. If the average disk usage is lower than 75% (that is, the alarm threshold minus 5%), this major alarm is cleared.

Node Data Disk Usage Exceeds the Threshold

Critical

88%

DWS collects the usage of all disks on each node in a cluster every 30 seconds.

If the maximum disk usage in the last 10 minutes (configurable) exceeds 88% (configurable), a critical alarm is reported. If the average disk usage is lower than 80% (alarm threshold minus 5%), this critical alarm is cleared.

CN Disk Capacity in a DWS Cluster Exceeds the Threshold

Critical

5000MB

This alarm is generated when the DMS alarm module detects that the amount of data written by a CN instance (with a specified ID) in a DWS cluster to disks exceeds the threshold in a specified period and the suppression conditions are not met. This alarm is cleared when the DMS alarm module detects that the amount of data written by a CN instance in a DWS cluster to disks is lower than the threshold.

Session Usage in a DWS Cluster Exceeds the Threshold

Major

80%

This major alarm is generated by the DMS alarm module if the session usage (number of non-system query SQL statements in real-time top SQL statements/max_connections) in the cluster goes beyond 80% (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the session usage in the cluster drops below the threshold.

Session Usage in a DWS Cluster Exceeds the Threshold

Critical

90%

This critical alarm is generated by the DMS alarm module if the session usage (number of non-system query SQL statements in top SQL statements/max_connections) in the cluster goes beyond 90% (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the session usage in the cluster drops below the threshold.

Instance Memory Usage of a Cluster Node Exceeds the Threshold

Major

80%

DWS collects the instance memory usage of each node in a cluster every 60 seconds. If a node's instance memory usage (100 x process_used_memory/max_process_memory, in which process_used_memory and max_process_memory can be obtained from PV_TOTAL_MEMORY_DETAIL) exceeds 80% (configurable), an alarm is reported indicating that the threshold has been surpassed. The alarm will be cleared if the average memory usage falls below 75% (5% below the reporting threshold).

Instance Memory Usage of a Cluster Node Exceeds the Threshold

Critical

90%

DWS collects the instance memory usage of each node in a cluster every 60 seconds. If a node's instance memory usage (100 x process_used_memory/max_process_memory, in which process_used_memory and max_process_memory can be obtained from PV_TOTAL_MEMORY_DETAIL) exceeds 90% (configurable), an alarm is reported indicating that the threshold has been surpassed. The alarm will be cleared if the average memory usage falls below 85% (5% below the reporting threshold).

Remaining DWS Database Disk Capacity Warning

Major

80%

This alarm is generated when the disk usage or inode usage of a cluster instance is greater than or equal to 80%. This alarm is cleared when the disk usage or inode usage of a cluster instance is lower than 80%.

Abnormal DWS Cluster

Critical

-

This alarm is generated when DWSHAMonitor detects that the cluster status is abnormal for three consecutive times.

Long SQL Probe Execution Duration in a Cluster

Critical

-

DWS collects the execution status of the SQL probe on each node in the cluster every 30 seconds. If the execution duration of an SQL probe on a server in a cluster exceeds twice the threshold (or another user-defined value), a critical alarm is generated. If the execution duration of all SQL probes falls below the threshold, the critical alarm is cleared.

DWS Cluster Restoration Failure

Critical

-

The kernel reports the restoration result each time the restoration is complete. The backup alarm is cleared when the system detects the next backup and restoration.

Abnormal Node

Critical

-

This alarm is generated when DWSHAMonitor detects that the node status is abnormal for three consecutive times.

Node Data Disk Latency Exceeds the Threshold

Major

400ms

DWS collects the data disk latency of each node in the cluster every 30 seconds. This alarm is generated when the average latency of a data disk on a node exceeds 400 ms (configurable) in the last 10 minutes (configurable), and is automatically cleared when the average latency drops below 400 ms.

A Vacuum Full Operation That Holds a Table Lock for A Long Time Exists in the Cluster

Major

20 minutes

VACUUM FULL holds a level-8 lock on a table. If it holds the lock on a table for longer than 20 minutes (or another user-defined value), a major alarm is reported, indicating that the VACUUM FULL operation holds a lock for too long in the cluster. This major alarm is cleared when VACUUM FULL is complete.

Remaining DWS Database Disk Capacity Is Severely Insufficient

Critical

-

This alarm is cleared when the disk or inode usage of a cluster instance is greater than or equal to 95%. This alarm is cleared when the disk or inode usage of the cluster instance is lower than 95%.

Pre-occupied Nodes Are Not Deleted During the Rollback After Restoration of Segment-based Warm Backup of a Yearly/Monthly Cluster

Major

-

After the segment-based warm backup of a yearly/monthly cluster is restored, a rollback will be performed. The pre-occupied idle nodes need to be deleted on the console. This alarm is generated if these nodes are not deleted.

Queue Congestion in the Default Cluster Resource Pool

Critical

-

DWS checks the queue in the default resource pool default_pool every 5 minutes. This alarm is generated when there are SQL statements that are queued for 20 minutes (default value, which is configurable) (real-time top SQL statements whose BLOCK_TIME exceeds 10,000). This alarm is automatically cleared when the alarm threshold is no longer met.

Node CPU Usage Exceeds the Threshold

Critical

95%

DWS collects the CPU usage of each node in a cluster every 30 seconds. If the average CPU usage of a node in the last 10 minutes (configurable) exceeds 95% (configurable), an alarm is reported indicating that the node CPU usage exceeds the threshold. If the average usage is lower than 85% (that is, the reporting threshold minus 5%), the alarm is cleared.

DWS Data Instance Connections Exceed the Threshold

Major

90%

This alarm is generated when the number of connections to a CN divided by max_connections is greater than connection_alarm_rate. max_connections and connection_alarm_rate are DWS parameters, and connection_alarm_rate is 0.9 by default.

Dynamic Memory Usage of a Cluster Node Exceeds the Threshold

Major

80%

DWS collects the dynamic memory usage of each node in a cluster every 60 seconds. If a node's dynamic memory usage (100 x dynamic_used_memory/max_dynamic_memory, in which dynamic_used_memory and max_dynamic_memory can be obtained from PV_TOTAL_MEMORY_DETAIL) exceeds 80% (configurable), an alarm is reported indicating that the threshold has been surpassed. The alarm will be cleared if the average memory usage falls below 75% (5% below the reporting threshold).

Dynamic Memory Usage of a Cluster Node Exceeds the Threshold

Critical

90%

DWS collects the dynamic memory usage of each node in a cluster every 60 seconds. If a node's dynamic memory usage (100 x dynamic_used_memory/max_dynamic_memory, in which dynamic_used_memory and max_dynamic_memory can be obtained from PV_TOTAL_MEMORY_DETAIL) exceeds 90% (configurable), an alarm is reported indicating that the threshold has been surpassed. The alarm will be cleared if the average memory usage falls below 85% (5% below the reporting threshold).

DWS Cluster Backup Failure

Major

-

The kernel reports the execution result after each execution. This alarm is generated when a backup failure is detected. This alarm is cleared when the next backup is successful.

Data Spilled to Disks of the Query Statement Exceeds the Threshold

Critical

5GB

This alarm is generated when the DMS alarm module detects that the amount of data written by SQL statements to disks (MAX_SPILL_SIZE of real-time top SQL statements/1,024) exceeds 5 GB (configurable) in 10 minutes (configurable). This alarm is cleared when there are no SQL statements that meet the alarm condition in the cluster. For details about how to modify alarm configurations, see Modifying Alarm Rules.

Failed to Invoke DWS OpenAPI

Critical

-

This alarm is generated when an unknown exception occurs during invocation of DMS OpenAPI.

Database Session Usage of the DWS Cluster Exceeds the Threshold

Major

80%

This major alarm is generated by the DMS alarm module if the database session usage (number of sessions of real-time top SQL statements/datconnlimit) in the cluster goes beyond 80% (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the database session usage in the cluster drops below the threshold.

Database Session Usage of the DWS Cluster Exceeds the Threshold

Critical

90%

This critical alarm is generated by the DMS alarm module if the database session usage (number of sessions of real-time top SQL statements/datconnlimit) in the cluster goes beyond 90% (configurable) within a specific time frame and the suppression conditions are not met. The alarm will be cleared once the database session usage in the cluster drops below the threshold.

Failed to Create Nodes After Cluster Restoration

Major

-

This alarm is generated when the nodes fail to be added to a restored cluster.

Number of Queuing Query Statements Exceeds the Threshold

Critical

10

This alarm is generated when the number of queuing SQL statements in the cluster (number of real-time top SQL statements whose BLOCK_TIME exceeds 5,000) exceeds 10 (configurable) within 10 minutes (configurable), and is automatically cleared when the number of queuing SQL statements drops below 10.

Failed to Back Up Some Tables in a DWS Cluster

Minor

-

The current version does not support fine-grained backup and restoration for online services. If a table is part of online services and undergoes modifications during backup, some table backups may fail while others remain unaffected. A minor alarm should be raised in such cases.

For example, if the alter table A add column a int operation is performed on table A during the backup, the definition of table A changes. In this case, the definition of table A that is backed up may be inconsistent with the data of table A. For security purposes, the backup cannot be used to restore table A.

However, subsequent successful backups can be used to restore tables. Failure to back up certain historical tables does not impact future backups.