Help Center/ GaussDB(DWS)/ Troubleshooting/ Cluster Performance/ Solution to High Disk Usage and Cluster Read-Only
Updated on 2024-03-13 GMT+08:00

Solution to High Disk Usage and Cluster Read-Only

Checking the Disk Usage

GaussDB(DWS) disk space is a high-value resource for users. It is closely related to cluster availability. Therefore, you need to pay close attention to the disk space and make response in a timely manner (in this section, disks refer to data disks).

To check the disk space, perform the following steps:

  1. Log in to the GaussDB(DWS) management console. Click Clusters on the left navigation pane. In the cluster list, click Monitoring on the right of the row that contains the desired cluster. The Monitoring page is displayed.
  2. Choose Monitoring > Node Monitoring. On the displayed page, choose the Disks tab. You can click on the right to sort the records based on disk usage.

    To identify a data disk, check the disk capacity. If the disk capacity equals to the purchased capacity, the disk is a data disk.

Fault Scenarios

  • Scenario 1: High Disk Usage: The usage of all disks or more than half of the disks in the current cluster is greater than or equal to 70%.
  • Scenario 2: Disk Skew: The difference between the highest usage and the lowest usage is greater than or equal to 10%.
  • Scenario 3: Cluster Read-only. (The current read-only threshold is that the usage of a single data disk is greater than or equal to 90%.)

During routine processing, the database administrator can use the Space Management and Control to identify and block abnormal workloads to avoid the preceding scenarios.

  1. Read-only is a self-protection mechanism of the GaussDB(DWS) system. It prevents GaussDB(DWS) instance startup failures caused by 100% disk usage.
  2. In the preceding scenarios, the DMS sends alarm notifications.
  3. In scenario 1, you can use the alarm subscription function to receive SMS messages or emails when the disk usage exceeds 70% or 75%. This allows you to clear data in advance. For details, see Subscribing to Disk Space Alarms.

Scenario 1: Clearing Data When the Disk Usage Is High

Periodically delete dirty data based on the query results. The method varies according to the cluster version.

  • 8.1.3 and later versions: Use the Intelligent O&M function on the management console to automatically clear dirty data.
  1. Log in to the GaussDB(DWS) management console.
  2. Click the name of the target cluster.
  3. In the navigation pane, choose Intelligent O&M.
  4. Click the O&M Plan tab. Click Add O&M Task.

  5. The Add O&M Task page is displayed.

    • Select Vacuum for O&M Task.
    • Set Scheduling Mode to Auto. GaussDB(DWS) automatically scans tables that require VACUUM operation.
    • Select System catalogs or User tables for Autovacuum.
      • If there are a large number of UPDATE and DELETE operations, select the User tables.
      • If there are a large number of CREATE and DELETE operations, select System catalogs.

  6. Click Next: Configure Schedule to configure the schedule and Vacuum type. You are advised to select Periodic for Task Type. The GaussDB(DWS) automatically executes VACUUM in your selected time windows.

    For automatic Vacuum O&M tasks, the system uses the VACUUM FULL operation to process user tables. VACUUM FULL holds a level 8 lock, which blocks other transactions. Other transactions will be in lock waiting during VACUUM FULL execution. After 20 minutes, a timeout error is reported. Therefore, do not perform other transactions in the configured time window.

  7. After confirming that the information is correct, click Next to complete the configuration.

  • 8.1.2 and earlier versions: Run the VACUUM FULL command to clear data.
    1. VACUUM FULL locks a table. During VACUUM FULL, all accesses to the table are blocked and wait until VACUUM FULL ends. Please schedule properly to prevent table locking from affecting services.
    2. VACUUM FULL is used to extract valid data from the current table and delete dirty data. This operation temporarily occupies extra space (the space will be released after the data is deleted). Therefore, the space increases and then decreases during VACUUM FULL, calculate the space required by VACUUM FULL in advance. (Extra space = Table size x (1 – Dirty page rate))
  1. Connect to the database, run the following SQL statement to query large tables whose dirty page rate exceeds 30%, and sort the tables by size in descending order:

    1
    2
    3
    4
    5
    SELECT schemaname AS schema, relname AS table_name, n_live_tup AS analyze_count, pg_size_pretty(pg_table_size(relid)) as table_size, dirty_page_rate
    FROM PGXC_GET_STAT_ALL_TABLES
    WHERE schemaName NOT IN ('pg_toast', 'pg_catalog', 'information_schema', 'cstore', 'pmk')
    AND dirty_page_rate > 30
    ORDER BY table_size DESC, dirty_page_rate DESC;
    

  2. Check whether any command output is displayed.

    • If yes, run the commands in 3 for a table larger than 10 GB.
    • If no, no further action is required.

  3. Run the VACUUM FULL command to clear the top 5 tables with the most dirty pages. If the maximum disk usage is greater than 70%, clear the tables one by one.

    1
    VACUUM FULL ANALYZE schema.table_name;
    

  4. If there is no table with a high dirty page rate and the disk usage is close to or exceeds 75%, expand the node or disk capacity of the cluster based on the following data warehouse types to prevent cluster read-only.

Scenario 2: Disk Skew and Clearing Skew Tables

If the skew rate of a table on a single DN is greater than or equal to 5%, you are advised to reselect a distribution key for the table and redistribute data.

  1. Impacts of skewed tables: Skew tables may cause severe skew in operator computing or spilling. As a result, some DNs may be overloaded, and the advantages of GaussDB(DWS) distributed computing cannot be fully utilized, affecting system performance. Besides, they are easily to cause disk space exhaustion on a single DN.
  2. In 8.1.3 and later versions, tables are created in polling (RoundRobin) mode by default (see RoundRobin). If you are not familiar with distribution keys, you can use the ROUNDROBIN keyword to create tables to simplify the service development (see Application Scenarios of Polling Tables and Hash Tables).
  1. Connect to the database and run the following SQL statement to query the skew tables:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    WITH skew AS
    (
            SELECT
                    schemaname,
                    tablename,
                    pg_catalog.sum(dnsize) AS totalsize,
                    pg_catalog.avg(dnsize) AS avgsize,
                    pg_catalog.max(dnsize) AS maxsize,
                    pg_catalog.min(dnsize) AS minsize,
                    (pg_catalog.max(dnsize) - pg_catalog.min(dnsize)) AS skewsize,
                    pg_catalog.stddev(dnsize) AS skewstddev
            FROM pg_catalog.pg_class c
            INNER JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
            INNER JOIN pg_catalog.gs_table_distribution() s ON s.schemaname = n.nspname AND s.tablename = c.relname
            INNER JOIN pg_catalog.pgxc_class x ON c.oid = x.pcrelid AND x.pclocatortype IN('H', 'N')
            GROUP BY schemaname,tablename
    )
    SELECT
            schemaname,
            tablename,
            totalsize,
            avgsize::numeric(1000),
            (maxsize/totalsize)::numeric(4,3)  AS maxratio,
            (minsize/totalsize)::numeric(4,3)  AS minratio,
            skewsize,
            (skewsize/avgsize)::numeric(4,3)  AS skewratio,
            skewstddev::numeric(1000)
    FROM skew
    WHERE totalsize > 0;
    

    For details about how to query skew tables, see section Data Skew Causes Slow SQL Statement Execution and Operations Fail on Large Tables.

  2. Select another distribution column for the table with severe skew based on the table size and skew rate. In 8.1.0 and later versions, use the ALTER TABLE syntax to change the distribution column. For other versions, see How Do I Change Distribution Columns.

Scenario 3: Cluster Read-only.

When the usage of a single disk in the cluster exceeds 90%, the system automatically triggers read-only for the cluster. In this case, if you continue to execute write services (DML or DDL), an error message similar to "ERROR: cannot execute xxx in a read-only transaction." is displayed.

Read-only is a mechanism of GaussDB(DWS) to protect user data. It prevents instance startup failures caused by 100% disk usage.

After the cluster is read-only, you need to perform operations such as DROP and TRUNCATE to delete unnecessary data as soon as possible to reduce the space usage to less than 80%. Then handle tables mentioned in scenarios 1 and 2 to prevent the disk usage increase caused by VACUUM FULL.
  • If the cluster version is 8.1.3 or later, perform the following steps:
  1. When a cluster is in read-only status, stop the write tasks to prevent data loss caused by disk space exhaustion.
  2. Use the client to connect to the database, disable read-only through START TRANSACTION, and run the DROP/TRUNCATE TABLE command to clear unnecessary data to ensure that the disk usage is less than 80%.

    Data clearing method 1:
    1
    2
    3
    START TRANSACTION READ WRITE;
    drop/truncate table table_name;
    COMMIT;
    
    Data clearing method 2:
    1
    2
    3
    4
    START TRANSACTION;
    SET transaction_read_only=off;
    drop/truncate table table_name;
    COMMIT;
    

    After the clearing is complete, the system automatically cancels the read-only mode.

  3. Check the tables mentioned in scenarios 1 and 2 to see whether there are tables that need to be handled. If no, you are advised to scale out the cluster as soon as possible. Based on the data warehouse type, scale-out is classified into node scale-out and disk scale-out.

  • For cluster 8.1.2 and earlier versions, perform the following steps:
  1. When a cluster is in read-only status, stop the write tasks to prevent data loss caused by disk space exhaustion.
  2. Log in to the GaussDB(DWS) management console and cancel the read-only state of the cluster.

    1. Log in to the GaussDB(DWS) management console. Click Clusters. All clusters will be displayed by default.
    2. In the Operation column of the target cluster, choose More > Cancel Readonly.

    3. In the dialog box that is displayed, click OK to confirm and cancel the read-only status for the cluster.

  3. After the read-only state is canceled, run the DROP/TRUNCATE command to delete unnecessary data. Ensure that the disk usage is less than 80%.
  4. Check the tables mentioned in scenarios 1 and 2 to see whether there are tables that need to be handled. If no, you are advised to scale out the cluster as soon as possible. Based on the data warehouse type, scale-out is classified into node scale-out and disk scale-out.

Space Management and Control

GaussDB(DWS) supports the sql_use_spacelimit and temp_file_limit parameters for statement disk space control. These parameters prevent disk usage spikes that can cause alarms or read-only mode during service operation. They also help detect services that exchange or import too much data to the database.

  1. Log in to the GaussDB(DWS) console, click Clusters on the left, and click the desired cluster. The cluster details page is displayed.
  2. Click Parameters, search for sql_use_spacelimit and temp_file_limit (see Disk Space) in the search box, and adjust them.

    You are advised to set sql_use_spacelimit to 10% of the total capacity. For example, if the purchased space is 100 GB per node, set sql_use_spacelimit to 10 GB.
    After the preceding configuration takes effect, if the space usage of a SQL statement exceeds the value of this parameter, the SQL statements will be terminated. To temporarily disable the function, run the following statement in the session:
    1
    SET sql_use_spacelimit=0;
    

Subscribing to Disk Space Alarms

To reduce customers' O&M pressure, GaussDB(DWS) provides the alarm subscription function. When the disk usage of a cluster is greater than the preset threshold, the system notifies you through SMS messages or emails.

  1. Set an alarm threshold:

    1. Log in to the GaussDB(DWS) console, choose Alarms on the left, and click Alarm Rule Management.

    2. In the alarm rule list, click Modify on the right of Node Data Disk Usage Exceeds the Threshold.
    3. In the cluster list, click the name of the target cluster. The Cluster Information page is displayed. Set the alarm policy as follows:
      • Associated Cluster: all clusters
      • Alarm Policy: Set the threshold of important alarms to 70% and the duration to Last 10 minutes, and set the threshold of urgent alarms to 75% and the duration to Last 10 minutes. Set the parameters as follows:

  2. Create a topic on the Simple Message Notification (SMN) console.

    1. Switch to the SMN console and click Create Topic. Set Topic Name and Enterprise Project as follows.

    2. After the topic is created, click Add Subscription on the right. Select SMS or Email, and enter the mobile number or email address in the Endpoint text box.

    3. The entered mobile number or email address will receive a confirmation SMS message or email. Click OK to complete the subscription.

  3. Add an alarm subscription.

    1. Return to the GaussDB(DWS) console, choose Alarms on the left, click Subscriptions, and click Create Subscription.
    2. Set Subscription Name to dms_alarm, Alarm Severity to Urgent and Important, and SMN Topic to dms_alarm created in the previous step.

    3. Click OK. After the alarm subscription is configured, a notification will be sent when the disk usage exceeds 70% or 75%.