Updated on 2025-01-06 GMT+08:00

Reducing I/O Usage

Symptom

In real-world GaussDB(DWS) service scenarios, there are numerous performance issues caused by high I/O and I/O bottlenecks. These problems mostly stem from improper application service design. This document provides guidance on optimizing services to enhance I/O efficiency and minimize slow I/O operations in common SQL scenarios.

Determining I/O Bottlenecks and Identifying Statements with High I/O

  1. This section describes the basic knowledge of SQL-level I/O problem analysis.

  2. Check and determine the I/O bottleneck using the pgxc_thread_wait_status view. See more status in PG_THREAD_WAIT_STATUS.

    1
    SELECT wait_status,wait_event,count(*) AS cnt FROM pgxc_thread_wait_status WHERE wait_status <> 'wait cmd' AND wait_status <> 'synchronize quit' AND wait_status <> 'none'  GROUP BY 1,2 ORDER BY 3 DESC limit 50;
    

    The table below illustrates the status when an I/O bottleneck arises.

    Table 1 Common I/O status

    Waiting Status

    Waiting Event

    wait io: waiting for I/O completion

    • BufFileRead: reading data from a temporary file to a specified buffer
    • BufFileWrite: writing the content of a specified buffer to a temporary file
    • DataFileRead: synchronously reading table data files
    • DataFileWrite: writing content to a table data file
    • ......

    acquire lwlock: waiting for acquiring the lightweight lock

    WALWriteLock: Used to prevent concurrent WAL writes to disks.

    wait wal sync: waiting for the WAL log of a specific LSN to be synchronized to the standby node

    NA

    wait data sync: waiting for data page synchronization to the standby node.

    NA

    Material | Material - write file: The current operator is Material. write file indicates that the Material operator is writing data to disks.

    NA

  3. Obtain SQL statements with high I/O consumption.

    Run OS commands to identify the threads with high CPU usage, and then find the service SQL statements with high CPU usage based on the GaussDB(DWS) thread ID. The tool required for these operations is iowatcher.

Scenario 1: Small CU Bloat in Column Storage

A SQL statement takes 43248 ms to query 390871 pieces of data. After analysis, it is found that most time is consumed by Cstore Scan.

In the Cstore Scan details, about 20,000 data records are scanned on each DN. 155079 CUs (CUSome) with data are scanned, and 156375 CUs (CUNone) without data are scanned. This means that CUs are bloated with many small CUs and unmatched CUs.

Trigger condition: High-frequency small-batch import to column-store tables (especially partitioned tables) causes CU bloat.

Solution

  1. Use large batches to import column-store data. It is recommended that the amount of data to be imported to a single partition in a single batch be greater than the number of DNs multiplied by 60,000.
  2. If data cannot be imported using large batches, perform periodic VACUUM FULL on column-store tables where data is frequently imported in small batches.
  3. When small CUs bloat rapidly, frequent VACUUM FULL operations consume a large number of I/Os and may aggravate the I/O bottleneck of the entire system. In this case, you need to replace the tables with row-store tables. (If CUs bloat severely for a long time, the advantages of storage space and sequential scan performance of column-store tables will no longer exist.)

Scenario 2: Dirty Data Cleanup

A SQL statement takes 2.519 seconds to execute, with 2.516 seconds spent on scanning. However, none of the 20,480 scanned records match the conditions and are all filtered out. The scanning time is too long considering the amount of data scanned, indicating that the scanning and I/O efficiency are significantly reduced due to a large amount of dirty data.

The dirty page rate of the table is 99%. After the VACUUM FULL operation is performed, the performance is optimized to about 100 ms.

Trigger condition: Frequent UPDATE or DELETE operations of tables and lack of timely VACUUM FULL operations lead to a large amount of dirty data.

Solution

  1. To address this issue, it is recommended to periodically perform VACUUM FULL on tables that are frequently updated or deleted to remove dirty data. Since VACUUM FULL is I/O-intensive for large tables, it should be scheduled during off-peak hours to avoid increasing I/O pressure during peak hours.
  2. If dirty data accumulates rapidly, excessive VACUUM FULL operations can consume a significant amount of I/Os and worsen the overall I/O bottleneck of the system. In such cases, it is important to investigate and identify the source of the dirty data. For scenarios where data is frequently deleted, the following solutions can be used:

    1. Use TRUNCATE instead of DELETE and use temporary tables.
    2. Regularly perform DELETE operation on data generated in a specified period. Use partition tables and use TRUNCATE or DROP (partitions) instead.

Scenario 3: Table Storage Skew

For instance, during a table scan, the maximum A-time of a DN is 6554 ms, and the minimum A-time of a DN is 0s. The DNs' scan time difference is over 10 times. This indicates that the problem is due to table storage skew.

Run the table_distribution command. It is found that all data is skewed to dn_6009. After the distribution column is modified to evenly distribute the table storage, the maximum DN A-time and the minimum DN A-time remain at the same level (about 400 ms), and the scan time is reduced from 6554 ms to 431 ms.

Trigger condition: Improper selection of the distribution column during data distribution can lead to storage skew and unbalanced pressure between DNs. This results in high I/O pressure on a single DN and overall decreased I/O efficiency.

Solution: Modify the distribution key of the table to ensure even distribution. For how to select a distribution key, see Selecting a Distribution Key.

Scenario 4: No Indexes or Indexes Are Not Used

The Seq Scan for a point query takes 3767 ms. It scans 4096000 records to find 8240 matches. This is suitable for Index Scan. However, even after adding indexes to the filter column, the plan stills use Seq Scan instead of Index Scan.

By analyzing the target table, the plan can automatically select indexes, optimizing performance from 3 seconds+ to 2 milliseconds+, significantly reducing I/O consumption.

Typical cases: When querying large row-store tables, only a small portion of the data is required, but sequential scanning is used instead of index scan, which lowers the I/O efficiency. Two typical cases are:

  • No index is created for the filter column.
  • An index exists, but Index Scan is not used.

Trigger conditions:

  • No index is created for the frequently used filter columns.
  • Data in the table is not analyzed promptly due to data changes after DML operations are performed. As a result, the optimizer cannot select an index scan plan. For details about ANALYZE, see ANALYZE | ANALYSE.

Solution:

  1. Add indexes to frequently used filter columns in row-store tables. The basic index design principles are as follows:

    • Choose a column with more distinct values that are frequently used as filters. For multiple filters, use a composite index. Put the column with more distinct values first in the composite index. Try to limit the number of indexes to 3 or less.
    • Importing a large amount of data with indexes can generate a significant number of I/Os. Therefore, it is recommended to carefully manage the number of indexes when importing a large amount of data. It is advised to delete the indexes before the import and recreate them after the import is complete.

  2. Perform ANALYZE regularly on tables where DML operations are frequently performed. The main scenarios are as follows:

    • Table data is generated for the first time.
    • Frequent INSERT, UPDATE, and DELETE operations are performed on a table.
    • The newly inserted data needs to be accessed immediately.

Scenario 5: No Partition or No Pruning for Partitions

For example, when filtering a service table by the createtime column to retrieve data at a specific time, the table is partitioned. However, if partition pruning is not performed due to a large number of selected partitions, the scan takes 701,785 milliseconds, resulting in poor I/O efficiency.

By adding the partition key createtime as a filter, partition pruning is enabled during the partitioned scan (given that the number of selected partitions is small). This improves performance from 700 seconds to 10 seconds, significantly boosting the I/O efficiency.

Common scenarios: For large tables that are time-based, most queries only access data for today or a few days. To improve I/O efficiency, partition pruning should be performed by scanning only the relevant partitions based on the partition keys. However, there are certain cases where partition pruning does not occur.

  • The table is not designed as a partitioned table.
  • Partitions do not use partition keys as filters.
  • When the partition key is used as the filter, some columns are transformed by functions.

Possible cause: Tables are not partitioned or partitions are not pruned, causing low scan efficiency.

Solution:

  • Design time-based large tables as partition tables.
  • Select a column with high discretion and that is frequently used as filters, like the time column, as the partition key.
  • Use a partition interval close to the one for frequent queries. For column-store tables, a short partition interval (like by hour) can create many small files. The partition interval should be at least by day.

Scenario 6: Calculating the Count Value in a Row-Store Table

For example, in a row-store large table, counting the entire table without any filters or with filters that only apply to a small amount of data is a typical scenario. This scanning process takes 43 seconds and consumes a significant amount of I/O resources. When multiple count jobs are executed concurrently in the system, the system's I/O remains at 100%, resulting in an I/O bottleneck and decreased performance.

Compared to column-store tables with the same data volume (A-rows is 40960000), scanning column-store tables only takes 14 milliseconds and has very low I/O usage.

Possible cause: The storage mode of row-store tables makes full table scans inefficient, resulting in constant I/O usage when scanning large tables.

Solution:

  • Perform only necessary table counts and lower the concurrency of count jobs.
  • Use column-store tables to improve the I/O efficiency.

Scenario 7: Calculating the Max Value in a Row-Store Table

For example, retrieving the maximum value of a column in a row-store table takes 26,772 milliseconds. When such tasks are executed concurrently, the system's I/O remains at 100%, causing an I/O bottleneck and decreased performance.

However, after adding an index to the max column, the execution time of the statement is reduced from 26 seconds to 32 milliseconds, significantly reducing I/O consumption.

Possible cause: To find the maximum value in a row-store table, each value is scanned. Scanning a lot of data consumes the I/O constantly.

Solution: Index the max column with a B-tree index. It sorts the values to speed up the scan and lower the I/O usage.

Scenario 8: Importing a Large Amount of Data with Indexes

Customer data sync to GaussDB(DWS) takes a long time and stresses the cluster's I/O.

The waiting view shows many wait wal sync and WALWriteLock states, meaning xlog sync is ongoing.

Possible cause: Importing a significant amount of data (through insert/copy/merge operations) along with multiple indexes leads to the generation of numerous Xlogs. This, in turn, causes a slowdown in the synchronization process between the active and standby nodes, resulting in the standby node being in the Catchup state for an extended period and a spike in I/O usage.

Solution:

  • Strictly control the number of indexes in each table. It is recommended that the number of indexes be less than or equal to 3.
  • Before importing a large amount of data, delete the indexes. After the data is imported, create the indexes again.

Scenario 9: Querying a Large Row-Store Table for the First Time

A customer's standby DN is continuously in the catchup status, the I/O pressure is high, and an SQL is in the wait wal sync status.

Check the service. It is found that the execution of a query statement takes a long time. After it is killed, the fault is rectified.

Possible cause: When a large volume of data is imported into a row-store table in the database, the first query triggers the generation of a large number of Xlogs due to page hints. This further hampers the synchronization between the active and standby nodes and consumes a significant amount of I/O resources.

Solution:

  • For large-scale access to new data at once, use a column-store table to import data.
  • Disable the wal_log_hints and enable_crc_check parameters. (This method is not recommended because data may be lost during the fault period.)

Scenario 10: Multiple Small Files and High IOPS

During the execution of a service, the IOPS of the entire cluster increases sharply. When the cluster faults, the rebuilding stalls, and the IOPS surges. Here is the table info:

SELECT relname,reloptions,partcount FROM pg_class c INNER JOIN ( SELECT parentid,count(*) AS partcount FROM pg_partition GROUP BY parentid ) s ON c.oid = s.parentid ORDER BY partcount DESC;

Possible cause: A service database has many column-store multi-partition tables (over 3000). This creates many small files (over 20 million files per DN), lowers the access efficiency, and slows down the cluster rebuilding. Also, rebuilding consumes many IOPS, hurting the service performance.

Solution:

  • Reduce the number of partitions in the column-store table by modifying the partition interval to reduce the number of files.
  • Change the column-store table to a row-store table, which does has so many bloated files.

Summary

Based on the previous scenarios, there are two ways to enhance I/O usage efficiency: improving I/O storage and enhancing computing/access efficiency.

  • Improving storage efficiency includes merging small CUs, reducing dirty data, and eliminating storage skews.
  • Improving computing efficiency includes partition pruning and using index scan. You can choose an improvement method based on actual scenarios.