Help Center/ GaussDB(DWS)/ User Guide/ GaussDB(DWS) Cluster O&M/ Viewing GaussDB(DWS) Cluster Alarms/ Alarm Handling/ DWS_2000000006 Node Data Disk Usage Exceeds the Threshold
Updated on 2024-10-21 GMT+08:00

DWS_2000000006 Node Data Disk Usage Exceeds the Threshold

Description

GaussDB(DWS) collects the usage of all disks on each node in a cluster every 30 seconds.

  • If the maximum disk usage in the last 10 minutes (configurable) exceeds 80% (configurable), a major alarm is reported. If the average disk usage is lower than 75% (that is, the alarm threshold minus 5%), this major alarm is cleared.
  • If the maximum disk usage in the last 10 minutes (configurable) exceeds 85% (configurable), a critical alarm is reported. If the average disk usage is lower than 85% (that is, the alarm threshold minus 5%), this critical alarm is cleared.

If the maximum disk usage is always greater than the alarm threshold, the system generates an alarm again 24 hours later (configurable).

Attributes

Alarm ID

Alarm Category

Alarm Severity

Alarm Type

Service Type

Auto Cleared

DWS_2000000006

Management plane alarm

Urgent: > 85%; important: > 80%

Operation alarm

GaussDB(DWS)

Yes

Parameters

Category

Name

Description

Location information

Name

Node Data Disk Usage Exceeds the Threshold

Type

Operation alarm

Generation time

Time when the alarm is generated

Other information

Cluster ID

Cluster details such as resourceId and domain_id

Impact on the System

If the cluster data volume or temporary data spill size increases and the usage of any single disk exceeds 90%, the cluster becomes read-only, affecting customer services.

Possible Causes

  • The service data volume increases rapidly, and the cluster disk capacity configuration cannot meet service requirements.
  • Dirty data is not cleared in a timely manner.
  • There are skew tables.

Handling Procedure

  1. Check the disk usage of each node.

    1. Log in to the GaussDB(DWS) console.
    2. On the Alarms page, select the current cluster from the cluster selection drop-down list in the upper right corner and view the alarm information of the cluster in the last seven days. Locate the name of the node for which the alarm is generated and the disk information based on the location information.
    3. On the Cluster > Dedicated Clusters page, locate the row that contains the cluster for which the alarm is generated and click Monitoring Panel in the Operation column.
    4. Choose Monitoring > Node Monitoring > Disks to view the usage of each disk on the current cluster node. If you want to view the historical monitoring information about a disk on a node, click on the right to view the disk performance metrics in the last 1, 3, 12, or 24 hours.
      • If the data disk usage frequently increases and then returns to normal in a short period of time, it indicates that the disk usage temporarily spikes due to service execution. In this case, you can adjust the alarm threshold through 2 to reduce the number of reported alarms.
      • If the usage of a data disk exceeds 90%, read-only is triggered and error cannot execute INSERT in a read-only transaction is reported for write-related services. In this case, you can refer to 3 to delete unnecessary data.
      • If the usage of more than half of the data disks in the cluster exceeds 70%, the data volume in the cluster is large. In this case, refer to 4 to clear data or perform Disk Capacity Expansion.
      • If the difference between the highest and lowest data disk usage in the cluster exceeds 10%, refer to 5 to handle data skew.

  2. Check whether the alarm configuration is proper.

    1. Return to the GaussDB(DWS) management console, choose Management > Alarms and click View Alarm Rule.
    2. Locate the row that contains Node Data Disk Usage Exceeds the Threshold and click Modify in the Operation column. On the Modifying an Alarm Rule page, view the configuration parameters of the current alarm.
    3. Adjust the alarm threshold and detection period. A higher alarm threshold and a longer detection period indicate a lower alarm sensitivity. For details about the GUI configuration, see Alarm Rules.
    4. If the data disk specification is high, you are advised to increase the threshold based on historical disk monitoring metrics. Otherwise, perform other steps. If the problem persists, you are advised to perform Disk Capacity Expansion.

  3. Check whether the cluster is in the read-only state.

    1. When a cluster is in read-only state, stop the write tasks to prevent data loss caused by disk space exhaustion.
    2. Return to the GaussDB(DWS) console and choose Clusters > Dedicated Clusters. In the row of the abnormal cluster whose cluster status is Read-only, click Cancel Read-only.
    3. In the displayed dialog box, confirm the information and click OK to cancel the read-only state for the cluster. For details, see Removing the Read-only Status.
    4. After the read-only mode is disabled, use the client to connect to the database and run the DROP/TRUNCATE command to delete unnecessary data.

      You are advised to lower the disk usage to below 70%. Check whether there are other tables that need to be rectified by referring to 4 and 5.

  4. Check whether the usage of more than half of the data disks in the cluster exceeds 70%.

    1. Run the VACUUM FULL command to clear data. For details, see Solution to High Disk Usage and Cluster Read-Only. Connect to the database, run the following SQL statement to query tables whose dirty page rate exceeds 30%, and sort the tables by size in descending order:
      1
      2
      3
      4
      5
      SELECT schemaname AS schema, relname AS table_name, n_live_tup AS analyze_count, pg_size_pretty(pg_table_size(relid)) as table_size, dirty_page_rate 
      FROM PGXC_GET_STAT_ALL_TABLES 
      WHERE schemaName NOT IN ('pg_toast', 'pg_catalog', 'information_schema', 'cstore', 'pmk') 
      AND dirty_page_rate > 30 
      ORDER BY table_size DESC, dirty_page_rate DESC;
      
      The following is an example of the possible execution result of the SQL statement (the dirty page rate of a table is high):
      1
      2
      3
      4
       schema | table_name | analyze_count | table_size | dirty_page_rate 
      --------+------------+---------------+------------+-----------------
       public | test_table |          4333 | 656 KB     |           71.11
      (1 row)
      
    2. If any result is displayed in the command output, clear the tables with a high dirty page rate in serial mode.
      1
      VACUUM FULL ANALYZE schema.table_name
      

      The VACUUM FULL operation occupies extra defragmentation space, which is Table size x (1 – Dirty page rate). As a result, the disk usage temporarily increases and then decreases. Ensure that the remaining space of the cluster is sufficient and will not trigger read-only when the VACUUM FULL operation is performed. You are advised to start from small tables. In addition, the VACUUM FULL operation holds an exclusive lock, during which access to the operated table is blocked. You need to properly arrange the execution time to avoid affecting services.

    3. If no command output is displayed, no table with a high dirty page rate exists. You can expand the node or disk capacity of the cluster based on the following data warehouse types to prevent service interruption caused by read-only triggered by further disk usage increase.
      1. Standard data warehouse using SSD cloud disk and hybrid data warehouse: See Disk Capacity Expansion.
      2. Standard data warehouse + SSD local disk and old standard data warehouse (disk scale-out is not supported): See Scaling Out a Cluster.

  5. Check whether the difference between the highest and lowest data disk usages in the cluster exceeds 10%.

    1. If the data disk usage differs greatly, connect to the database and run the following SQL statement to check there are skew tables in the cluster:
      1
      SELECT schemaname, tablename, pg_size_pretty(totalsize), skewratio FROM pgxc_get_table_skewness WHERE skewratio > 0.05 ORDER BY totalsize desc;
      
      The following is an example of the possible execution result of the SQL statement:
      1
      2
      3
      4
      5
      6
      7
       schemaname |      tablename      | pg_size_pretty | skewratio 
      ------------+---------------------+----------------+-----------
       scheduler  | workload_collection | 428 MB         |      .500
       public     | test_table          | 672 KB         |      .429
       public     | tbl_col             | 104 KB         |      .154
       scheduler  | scheduler_storage   | 32 KB          |      .250
      (4 rows)
      
    2. If the SQL statement output is displayed, select another distribution column for the table with severe skew based on the table size and skew rate. For 8.1.0 and later versions, use the ALTER TABLE syntax to adjust the distribution column. For other versions, see How Do I Adjust Distribution Columns?

Alarm Clearance

After the disk usage decreases, the alarm is automatically cleared.