Help Center> GaussDB(DWS)> Management Guide> Monitoring and Alarms> Alarms> Alarm Handling> DWS_2000000006 Node Data Disk Usage Exceeds the Threshold
Updated on 2024-03-14 GMT+08:00

DWS_2000000006 Node Data Disk Usage Exceeds the Threshold

Description

GaussDB(DWS) collects the usage of all disks on each node in a cluster every 30 seconds.

  • If the maximum disk usage in the last 10 minutes (configurable) exceeds 80% (configurable), a major alarm is reported. If the average disk usage is lower than 75% (that is, the alarm threshold minus 5%), this major alarm is cleared.
  • If the maximum disk usage in the last 10 minutes (configurable) exceeds 85% (configurable), a critical alarm is reported. If the average disk usage is lower than 85% (that is, the alarm threshold minus 5%), this critical alarm is cleared.

If the maximum disk usage is always greater than the alarm threshold, the system generates an alarm again 24 hours later (configurable).

Attributes

Alarm ID

Alarm Severity

Auto Clear

DWS_2000000006

Critical/Major

Yes

Parameters

Parameter

Description

Source

Name of the system for which the alarm is generated, for example, GaussDB(DWS).

Cluster Name

Cluster for which the alarm is generated.

Location Information

IDs and names of the cluster and instance for which the alarm is generated, for example, cluster_id: xxxx-xxxx-xxxx-xxxx, cluster_name: test_dws, instance_id: xxxx-xxxx-xxxx-xxxx, instance_name: test_dws-dws-cn-cn-1-1.

Detail Information

Detailed information about the alarm, including the cluster, instance, disk, and threshold information. Example: CloudService=DWS, resourceId: xxxx-xxxx-xxxx-xxxx, resourceIdName: test_dws, instance_id: xxxx-xxxx-xxxx-xxxx, instance_name: test_dws-dws-cn-cn-2-1, host_name: host-192-168-1-122, disk_name: /dev/vdb, first_alarm_time: 2022-11-26 11:14:58; The average data disk usage of the node within 10 minutes is 84%, which exceeds the threshold 80%.

Generated

Time when an alarm is generated.

Status

Status of the current alarm.

Impact on the System

If the cluster data volume or temporary data spill size increases and the usage of any single disk exceeds 90%, the cluster becomes read-only, affecting customer services.

Possible Causes

  • The service data volume increases rapidly, and the cluster disk capacity configuration cannot meet service requirements.
  • Dirty data is not cleared in a timely manner.
  • There are skew tables.

Handling Procedure

  1. Check the disk usage of each node.

    1. Log in to the GaussDB(DWS) console.
    2. On the Alarms page, select the current cluster from the cluster selection drop-down list in the upper right corner and view the alarm information of the cluster in the last seven days. Locate the name of the node for which the alarm is generated and the disk information based on the location information.

    3. On the Cluster > Dedicated Cluster page, locate the row that contains the cluster for which the alarm is generated and click Monitoring Panel in the Operation column.

    4. Choose Monitoring > Node Monitoring > Disks to view the usage of each disk on the current cluster node. If you want to view the historical monitoring information about a disk on a node, click on the right to view the disk performance metrics in the last 1, 3, 12, or 24 hours.
      • If the data disk usage frequently increases and then returns to normal in a short period of time, it indicates that the disk usage temporarily spikes due to service execution. In this case, you can adjust the alarm threshold through 2 to reduce the number of reported alarms.
      • If the usage of a data disk exceeds 90%, read-only is triggered and error cannot execute INSERT in a read-only transaction is reported for write-related services. In this case, you can refer to 3 to delete unnecessary data.
      • If the usage of more than half of the data disks in the cluster exceeds 70%, the data volume in the cluster is large. In this case, refer to 4 to clear data or perform Disk Capacity Expansion.
      • If the difference between the highest and lowest data disk usage in the cluster exceeds 10%, refer to 5 to handle data skew.

  2. Check whether the alarm configuration is proper.

    1. Return to the GaussDB(DWS) management console and choose Alarms > Alarm Rule.

    2. Locate the row that contains Node Data Disk Usage Exceeds the Threshold and click Modify in the Operation column. On the Modifying an Alarm Rule page, view the configuration parameters of the current alarm.

    3. Adjust the alarm threshold and detection period. A higher alarm threshold and a longer detection period indicate a lower alarm sensitivity. For details about the GUI configuration, see Alarm Rules.
    4. If the data disk specification is high, you are advised to increase the threshold based on historical disk monitoring metrics. Otherwise, perform other steps. If the problem persists, you are advised to perform Disk Capacity Expansion.

  3. Check whether the cluster is in the read-only state.

    1. When a cluster is in read-only state, stop the write tasks to prevent data loss caused by disk space exhaustion.
    2. Return to the GaussDB(DWS) management console, click Clusters > Dedicated Clusters, locate the row that contains the abnormal cluster, and choose More > Cancel Read-Only in the Operation column.

    3. In the displayed dialog box, confirm the information and click OK to cancel the read-only state for the cluster. For details, see Removing the Read-only Status.
    4. After the read-only mode is disabled, use the client to connect to the database and run the DROP/TRUNCATE command to delete unnecessary data.

      You are advised to lower the disk usage to below 70%. Check whether there are other tables that need to be rectified by referring to 4 and 5.

  4. Check whether the usage of more than half of the data disks in the cluster exceeds 70%.

    1. Run the VACUUM FULL command to clear data. For details, see Solution to High Disk Usage and Cluster Read-Only. Connect to the database, run the following SQL statement to query tables whose dirty page rate exceeds 30%, and sort the tables by size in descending order:
      1
      2
      3
      4
      5
      SELECT schemaname AS schema, relname AS table_name, n_live_tup AS analyze_count, pg_size_pretty(pg_table_size(relid)) as table_size, dirty_page_rate 
      FROM PGXC_GET_STAT_ALL_TABLES 
      WHERE schemaName NOT IN ('pg_toast', 'pg_catalog', 'information_schema', 'cstore', 'pmk') 
      AND dirty_page_rate > 30 
      ORDER BY table_size DESC, dirty_page_rate DESC;
      
      The following is an example of the possible execution result of the SQL statement (the dirty page rate of a table is high):
      1
      2
      3
      4
       schema | table_name | analyze_count | table_size | dirty_page_rate 
      --------+------------+---------------+------------+-----------------
       public | test_table |          4333 | 656 KB     |           71.11
      (1 row)
      
    2. If any result is displayed in the command output, clear the tables with a high dirty page rate in serial mode.
      1
      VACUUM FULL ANALYZE schema.table_name
      

      The VACUUM FULL operation occupies extra defragmentation space, which is Table size x (1 – Dirty page rate). As a result, the disk usage temporarily increases and then decreases. Ensure that the remaining space of the cluster is sufficient and will not trigger read-only when the VACUUM FULL operation is performed. You are advised to start from small tables. In addition, the VACUUM FULL operation holds an exclusive lock, during which access to the operated table is blocked. You need to properly arrange the execution time to avoid affecting services.

    3. If no command output is displayed, no table with a high dirty page rate exists. You can expand the node or disk capacity of the cluster based on the following data warehouse types to prevent service interruption caused by read-only triggered by further disk usage increase.
      1. Standard data warehouse + SSD cloud disk, stream data warehouse, and hybrid data warehouse: See Disk Capacity Expansion.
      2. Standard data warehouse + SSD local disk and old standard data warehouse (disk scale-out is not supported): See Scaling Out a Cluster.

  5. Check whether the difference between the highest and lowest data disk usages in the cluster exceeds 10%.

    1. If the data disk usage differs greatly, connect to the database and run the following SQL statement to check there are skew tables in the cluster:
      1
      SELECT schemaname, tablename, pg_size_pretty(totalsize), skewratio FROM pgxc_get_table_skewness WHERE skewratio > 0.05 ORDER BY totalsize desc;
      
      The following is an example of the possible execution result of the SQL statement:
      1
      2
      3
      4
      5
      6
      7
       schemaname |      tablename      | pg_size_pretty | skewratio 
      ------------+---------------------+----------------+-----------
       scheduler  | workload_collection | 428 MB         |      .500
       public     | test_table          | 672 KB         |      .429
       public     | tbl_col             | 104 KB         |      .154
       scheduler  | scheduler_storage   | 32 KB          |      .250
      (4 rows)
      
    2. If the SQL statement output is displayed, select another distribution column for the table with severe skew based on the table size and skew rate. For 8.1.0 and later versions, use the ALTER TABLE syntax to adjust the distribution column. For other versions, see How Do I Adjust Distribution Columns?

Alarm Clearance

After the disk usage decreases, the alarm is automatically cleared.