Help Center/ GaussDB(DWS)/ Best Practices/ Decoupled Storage and Compute/ DWS 3.0 Decoupled Storage and Compute Usage Suggestions and Performance Optimization

Updated on 2024-11-08 GMT+08:00

View PDF

DWS 3.0 Decoupled Storage and Compute Usage Suggestions and Performance Optimization

Scenarios

The newly released DWS 3.0 version provides resource pooling, massive storage, and the MPP architecture with decoupled compute and storage. This enables high elasticity, real-time data import and sharing, and lake warehouse integration.

For more information about decoupled compute and storage, see What Is GaussDB(DWS)?.

This document describes the performance optimization and precautions of the decoupled storage-compute version.

Purchasing a Cluster

EVS disk space

In version 9.1.0.x, only column-store user data is stored on OBS, while other data remains on local disks. Consequently, even in a storage-compute decoupled architecture, additional EVS disks still need to be configured, and not just minimally sized disks. The storage-compute decoupling capability will continue to evolve in future versions.

**Table 1** Storage of different table types
Table Type	Storage Location	Scenario
Row-store table, temporary table, and column-store index	Local, no compression	Point query, real-time small-batch import, and frequent update.
Column-store table 2.0	Local, compressed	Batch import, query, real-time small-batch import, point query, and update.
Column-store table 3.0	OBS, compressed	Batch import, query, and low-frequency batch update.

EVS storage: row-store and column-store metadata (min/max), indexes, Delta, WAL, OBS data cache, and temporary files (sort/hash) in computing. The size can be specified when you purchase a GaussDB(DWS) cluster.

Formula for calculating the required total EVS storage size::

(2 copies x (Row-store table size + Index size + Delta table size) + OBS hot data cache size)/0.8 (reserved)

When the total EVS storage size surpasses 90%, the cluster transitions to read-only mode. The reserved 10% space is allocated for storing WALs and temporary files.

OBS hot data: 1. The size of hot data is clearly known. 2. If you do not know the data volume, you can select 30% of the total OBS data volume.
Size of a column-store index = Size of original uncompressed data x Index column width x 3 (Bloat rate) /Total column width
The column-store data is assessed using a 3x compression ratio. Assuming a table contains 20 columns, with two serving as the primary key, the index constitutes 30% of the data prior to compression, equivalent to the data post-compression.
Delta size: Max (10 GB, size of the compressed table/10) of a table (or partition)
Row-store indexes are evaluated based on the rate of 30%.

Recommended EVS disk space: Utilizing the aforementioned algorithm can be complex. It is advisable to configure the total EVS disk space to match the total data volume after compression, assuming a compression ratio of 5x. If the table lacks an index (with EVS primarily serving as cache), set the total EVS disk space to 50% or 30% of the total data volume (excluding archived data) and subsequently increase the disk cache size (refer to the following section for details).

Minimum capacity

Performance-sensitive customers: Ensure that each primary/standby DN has a minimum disk capacity of 500 GB to achieve a throughput of 350 MB/s per disk. For instance, if two primary and two standby DNs are deployed on an ECS, at least four 500 GB disks must be mounted to the ECS.
Cost-sensitive customers: At least 200 GB (160 MB/s per disk) is mounted to each primary/standby DN.

OBS configuration
OBS must be deployed in three AZs and support parallel file systems.

OBS performance

If you need to adjust the OBS performance specifications, contact technical support. In public cloud scenarios, OBS metrics do not need to be adjusted for a cluster with six or fewer DNs.

CPU configuration
In a production environment, it is advised that each node has a minimum of 16 vCPUs. Configurations of 4 vCPUs or 8 vCPUs are typically used only for experimental or testing purposes.
- Migration scenario: consistent with the migration objects.
- New deployment scenario: Calculate the number of required CPU cores by dividing the total data volume by 100 GB. Note that the number of CPU cores required can vary based on the specific use case. If the computational load is high, it is recommended to increase the number of CPU cores accordingly.

Table Design Optimization

Table creation statements

By default, DWS creates row-store tables. In OLAP analysis scenarios, you need to explicitly set ORIENTATION to column-store when creating tables.

For details, see CREATE TABLE in SQL Syntax Reference.

    
         CREATE TABLE public.t1(id integer) WITH (ORIENTATION =COLUMN);

Table design optimization

In a storage-compute decoupling architecture, data is stored in the Object Storage Service (OBS). To optimize performance, it's crucial to use filtering methods to avoid unnecessary remote Compute Unit (CU) data read overhead. This approach significantly enhances performance.

Filtering methods

GaussDB(DWS) is compatible with the PostgreSQL ecosystem, utilizing both row storage with B-tree indexes similar to PostgreSQL, and self-developed column storage with its own indexing system. When creating a table, it's important to select an appropriate storage mode, distribution column, partition key, and index to ensure that data can be quickly accessed during SQL execution, thereby reducing I/O consumption. The following figure shows the process from initiating an SQL statement to obtaining data. You can understand the function of each technical approach for better performance optimization.

When the SQL statement is executed, the partition table is optimized using the Partition Column to pinpoint the specific partition.
The Distribute Column is used in a distributed hash table to quickly identify the data shard where the data resides. In a storage-compute coupled architecture, the data shard is located on a DN, while in a storage-compute decoupled architecture, it's located on a bucket.
In row-store mode, B-tree is used to quickly locate the data page. In column-store mode, the min-max index is used to quickly locate the CU data block that may contain relevant data. This index is particularly effective when filtering on the Partition Key (PCK) column.
The system automatically maintains the min-max index for all columns in the column-store mode. There's no need for manual index definition. The min-max index serves as a coarse filter. However, CU data blocks that meet the min-max condition may not necessarily contain data rows that meet the filter condition. If a bitmap column is defined, the bitmap index can be used to quickly locate the row number of the data that meets the filter condition within the CU. For ordered CUs, binary search can also be employed to quickly find the row number.
Column-store also supports B-tree and GIN indexes, which can be used to quickly locate the CU and row number of the data that meets the conditions. However, the maintenance cost of these indexes is high. Unless there are high performance requirements for point queries, it is recommended to use bitmap indexes instead of B-tree or GIN indexes.

Optimization methods

The following uses a table creation statement to describe the existing optimization methods of GaussDB(DWS). For details, see CREATE TABLE in SQL Syntax Reference.

Table 2 Optimization methods

No.

Optimization Item

Suggestion

Example SQL

Modifiable After Creation

String type

The string type exhibits slower performance compared to the fixed-length type. It is not recommended for scenarios where the fixed-length type is suitable.
When the specified length is less than 16, performance can be improved by up to double. However, this optimization does not provide benefits if the specified length exceeds 16.

Yes. Modification rewrites existing data.

Numeric type

Specify precision for the numeric type, which doubles the performance. Do not use numeric types without precision.

Yes. Modification rewrites existing data.

Partition by Column

Define partition tables. Partition keys enable pruning, and partition-wise joins are supported for equality and range query scenarios.
Define less than 1,000 partitions with a maximum of 2 partition columns.

         
              SELECT * FROM t1 WHERE t1.c1='p1';

No. If modification is needed, create a table again.

secondary_part_column

Define this field. It is applicable only to column-store tables and equality queries.
Specify a level-2 partition on the most commonly used equivalent filter.

         
              SELECT *  FROM t1 WHERE t1.c1='p1';

No. If modification is needed, create a table again.

Distribute by Column

Define this field. It is applicable to the join field frequently used for GROUP BY or multi-table join operations. Local joins reduce data shuffling and are suitable for equivalent queries.

         
              SELECT * FROM t1 join t2 on t1.c3 = t2.c1;

No. If modification is needed, create a table again.

Bitmap_columns

Create an adaptive bitmap index (for cardinality ≤ 32) or bloom filter (for cardinality > 32) based on repeated values in the CU. This is applicable to equivalent query scenarios for VARCHAR or TEXT columns. It is recommended to define the columns involved in the WHERE condition.

         
              SELECT * FROM t1 WHERE t1.c4 = 'hello';

Yes. Modification does not rewrite existing data. Only the new data is affected.

Min-max index

The min-max index is automatically generated and can be used for both equality and range queries.
The min-max filtering effect depends on the data order. Specifying the PCK column enhances the filtering effect.

         
              SELECT * FROM t1 WHERE c3 > 100 and c3 < 200;

The PCK can be modified. Modification does not rewrite existing data. Only the new data is affected.

Primary key (btree index)

Importing upsert data to the database heavily relies on the primary key, which must be defined by users and is applicable to equality and range query scenarios.
It is preferable to use fixed-length type columns when service requirements allow. Place columns with more distinct values at the front whenever possible.

         
              SELECT * FROM t1 WHERE c3 >100 and c3 < 200;

Yes. After modification, the index will be recreated.

GIN index

Define this parameter. It is suitable for multi-condition equality queries. Avoid using columns with more than 1 million distinct values.
It is recommended when the data volume after filtering is less than 1000. If the data volume remains large after filtering, it is not recommended.

         
              SELECT * FROM t1 WHERE c1 = 200 and c2 = 105;

Yes. After modification, the index will be recreated.

Orientation=column/row

Specify whether a table is stored in row or column mode. Row-store tables are uncompressed, ideal for point queries and frequent updates. Column-store tables are compressed and are best suited for analysis scenarios.

No. If modification is needed, create a table again.

Disk Cache

DWS caches frequently accessed data on local EVS disks to minimize direct reads from OBS and enhance data query performance. The disk cache is exclusive to the DN compute nodes and is not present on the CNs.

Cache size

The default cache size (disk_cache_max_size) of the cluster is half of the EVS capacity.

The EVS capacity is split into two sections: half of it is allocated for storing local persistent data, such as column-store indexes, row-store tables, and local column-store tables, while the other half is set aside for cache purposes. GaussDB(DWS) indexes differ from Redshift indexes. Redshift indexes serve solely as optimizer prompts and do not contain actual index data. In contrast, GaussDB(DWS) indexes resemble Oracle indexes and store index data.

If no index is created for a column-store table, increase the cache size by adjusting the value of disk_cache_max_size on the GaussDB(DWS) management console.

Cache status

When a user queries data, the system first checks if the data exists in the local disk cache. If the data is not present, the system reads it from OBS and caches it to the local disk for future access. This caching mechanism can significantly enhance the speed of OBS data queries.

By default, the disk cache uses two disks in active/standby mode as the cache media. You can query the following parameters to view related information:

Use the disk_cache_base_paths parameter to view, add, or delete cache disk paths.
Use the disk_cache_max_size parameter to view and adjust the disk cache size.

Use the pgxc_disk_cache_all_stats view to view the current cache hit ratio and the disk usage of each DN.

Figure 1 Query result from pgxc_disk_cache_all_stats

Cache dual-write

Enabling Cache Dual-Write can enhance the performance of the first data query. Specifically, when data is written to the remote OBS, it is also written to the local disk cache. This improves read efficiency significantly during the initial data access. You can use the disk_cache_dual_write_option to configure whether to enable cache dual-write. The options are as follows:

none: Disable cache dual-write.
hstore_only (default value): Enable cache dual-write only for the Hstore opt table during delta merge.
all: Enable cache dual-write for both common v3 tables and hstore opt tables.

Clearing the Cache

Use the pgxc_clear_disk_cache() function to clear all disk caches.

Insufficient cluster space and disk cache space adjustment

To address resource shortages in a cluster, consider reducing the disk cache space. This can help free up disk space and alleviate the issue, especially for clusters that have already used a significant amount of disk cache space.

Adjust the disk_cache_max_size parameter to reduce the actual disk cache space and alleviate cluster space insufficiency.

For example, if the total disk capacity is 1,000 GB and the value of disk_cache_max_size is 500 GB, and the actual disk usage queried in the pgxc_disk_cache_all_stats view is 450 GB. If the total disk space usage reaches 900 GB the ThresholdReadRisk issue will be triggered, indicating insufficient remaining resources. If there are no column storage 2.0 tables or index resources that can be cleared, you can change the value of disk_cache_max_size to 300 GB or a smaller value to alleviate the space insufficiency problem. Note that reducing the available disk cache may deteriorate query performance.

Disk usage alarms are determined as follows:

Capacity warning: Disk space usage or file descriptor usage exceeds the ThresholdReadOnly value (80% by default). The log will contain "Disk usage on the node %u has reached the risky threshold 80%."
Insufficient capacity: Disk space usage or file descriptor usage exceeds the ThresholdReadRisk value (90% by default), making the cluster read-only. The log will display "Disk usage on the node %u has reached the read-only threshold 90%."
Severely insufficient capacity: Disk space usage or file descriptor usage exceeds the ThresholdReadDanger value (95% by default), causing the standby and secondary data nodes (DNs) to terminate and the primary DN to restart. The log will display "Disk usage on the node %u has reached the dangerous threshold 95%."

Insertion Performance

Bucket storage

Bucket storage is a method of data sharding that, similar to partitioning technology, groups data with the same attribute values together. This approach facilitates the adjustment of the mapping between storage and computing, enabling a separation of storage and computing resources, and allowing for elastic scaling and on-demand allocation of computing resources.

For instance, if there are eight buckets and two DNs, each DN would be responsible for four buckets. Conversely, if there are four DNs, each DN would be responsible for two buckets.

Figure 2 Bucket storage

Import optimization

Data needs to be saved to the database in batches, with asynchronous I/O.

Batching: This technique is employed to avoid small Compute Units (CUs) and enhance subsequent query performance.

Asynchronous I/O: Following the decoupling of storage and compute, the latency for writing data to OBS is approximately ten times higher than writing data to EVS. Asynchronous I/O optimizes read and write performance.

For partitioned tables, 2.0 tables require only partition batching. 3.0 tables, in comparison, require bucket batching (equivalent to level-2 partitions), which may consume more memory and disk space.
Only hash-distributed tables necessitate bucket batching.

Batching overhead and suggestions

Overhead

Number of partitions
Number of buckets on each node: #Nb
RowGroup size before compression: #Nr
Maximum size of a single bucket: #Mb = max (partition_max_cache_size/partition_men_batch, 16M) = 16M (default configuration)

Single-concurrency batching consumption: #Np * #Nb * #Nr
Single-concurrency batching memory consumption: partition_max_cache_size. The default value is 2 GB.
Single-concurrency batching disk consumption: #Np * #Nb * #Nr * 1.2 (bloat rate) - Memory consumption 

Assume that data is copied at a time, 1000 partitions are involved, #Nb ≈ 10, the size of a single record is 1 KB, and the total batching size is 10,000 rows.
Single-concurrency batching consumption: 1000 * 10 * 1K * 10000 * 1.2 = 120 GB

Suggestions

Application layer optimization: The key factor is the number of partitions. It is recommended to use a single partition for importing data into the database. If the consumed space of a single concurrent batching operation is reduced from 120 GB to 120 MB, the memory can be directly utilized for batching.
Database Kernel Optimization: Modify the min_batch_rows parameter to adjust the batch size. You can execute the SET statement to apply the changes for the current session or modify the configuration file to make the changes effective immediately.

Parent topic: Decoupled Storage and Compute

Previous topic: Decoupled Storage and Compute

Next topic: Data Development