Updated on 2023-12-28 GMT+08:00

Table Structure Design

Before you optimize a table, you need to understand the structure of the table. During database design, some key factors about table design will greatly affect the subsequent query performance of the database. Table design affects data storage as well. Scientific table design reduces I/O operations and minimizes memory usage, improving the query performance.

This section describes how to optimize table performance in GaussDB(DWS) by properly designing the table structure (for example, by configuring the table storage mode, compression level, distribution mode, distribution column, partitioned tables, and local clustering).

Selecting a Storage Mode

Selecting a model for table storage is the first step of table definition. Select a proper storage model for your service based on the table below.

Generally, if a table contains many columns (called a wide table) and its query involves only a few columns, column storage is recommended. If a table contains only a few columns and a query involves most of the columns, row storage is recommended.

Storage Model

Application Scenario

Row storage

Point query (simple index–based query that returns only a few records).

Query involving many INSERT, UPDATE, and DELETE operations.

Column storage

Statistics analysis query, in which operations, such as group and join, are performed many times.

The row/column storage of a table is specified by the orientation attribute in the table definition. The value row indicates a row-store table and column indicates a column-store table. The default value is row.

Table Compression

Table compression can be enabled when a table is created. Table compression enables data in the table to be stored in compressed format to reduce memory usage.

In scenarios where I/O is large (much data is read and written) and CPU is sufficient (little data is computed), select a high compression ratio. In scenarios where I/O is small and CPU is insufficient, select a low compression ratio. Based on this principle, you are advised to select different compression ratios and test and compare the results to select the optimal compression ratio as required. Specify a compressions ratio using the COMPRESSION parameter. The supported values are as follows:

  • The valid value of column-store tables is YES, NO, LOW, MIDDLE, or HIGH, and the default value is LOW.
  • The valid values of row-store tables are YES and NO, and the default is NO. (The row-store table compression function is not put into commercial use. To use this function, contact technical support.)

The service scenarios applicable to each compression level are described in the following table.

Compression Level

Application Scenario

LOW

The system CPU usage is high and the disk storage space is sufficient.

MIDDLE

The system CPU usage is moderate and the disk storage space is insufficient.

HIGH

The system CPU usage is low and the disk storage space is insufficient.

Selecting a Distribution Mode

GaussDB(DWS) supports the following distribution modes: replication, hash, and Round-robin.

Round-robin is supported in cluster 8.1.2 and later.

Policy

Description

Application Scenario

Advantages/disadvantages

Replication

Full data in a table is stored on each DN in the cluster.

Small tables and dimension tables

  • The advantage of replication is that each DN has full data of the table. During the join operation, data does not need to be redistributed, reducing network overheads and reducing plan segments (each plan segment starts a corresponding thread).
  • The disadvantage of replication is that each DN retains the complete data of the table, resulting in data redundancy. Generally, replication is only used for small dimension tables.

Hash

Table data is distributed on all DNs in the cluster.

Fact tables containing a large amount of data

  • The I/O resources of each node can be used during data read/write, greatly improving the read/write speed of a table.
  • Generally, a large table (containing over 1 million records) is defined as a hash table.

Polling (Round-robin)

Each row in the table is sent to each DN in turn. Data can be evenly distributed on each DN.

Fact tables that contain a large amount of data and cannot find a proper distribution key in hash mode

  • Round-robin can avoid data skew, improving the space utilization of the cluster.
  • Round-robin does not support local DN optimization like a hash table does, and the query performance of Round-robin is usually lower than that of a hash table.
  • If a proper distribution key can be found for a large table, use the hash distribution mode with better performance. Otherwise, define the table as a round-robin table.

Selecting a Distribution Key

If the hash distribution mode is used, a distribution key must be specified for the user table. If a record is inserted, the system performs hash computing based on values in the distribute column and then stores data on the related DN.

Select a hash distribution key based on the following principles:

  1. The values of the distribution key should be discrete so that data can be evenly distributed on each DN. You can select the primary key of the table as the distribution key. For example, for a person information table, choose the ID number column as the distribution key.
  2. Do not select the column where a constant filter exists. For example, if a constant constraint (for example, zqdh= '000001') exists on the zqdh column in some queries on the dwcjk table, you are not advised to use zqdh as the distribution key.
  3. With the above principles met, you can select join conditions as distribution keys, so that join tasks can be pushed down to DNs for execution, reducing the amount of data transferred between the DNs.

    For a hash table, an improper distribution key may cause data skew or poor I/O performance on certain DNs. Therefore, you need to check the table to ensure that data is evenly distributed on each DN. You can run the following SQL statements to check for data skew:

    1
    2
    3
    4
    5
    select 
    xc_node_id, count(1) 
    from tablename 
    group by xc_node_id 
    order by xc_node_id desc;
    

    xc_node_id corresponds to a DN. Generally, over 5% difference between the amount of data on different DNs is regarded as data skew. If the difference is over 10%, choose another distribution key.

  4. You are not advised to add a column as a distribution key, especially add a new column and use the SEQUENCE value to fill the column. (Sequences may cause performance bottlenecks and unnecessary maintenance costs.)

Using Partitioned Tables

Partitioning refers to splitting what is logically one large table into smaller physical pieces based on specific schemes. The table based on the logic is called a partitioned table, and a physical piece is called a partition. Data is stored on these smaller physical pieces, namely, partitions, instead of the larger logical partitioned table. A partitioned table has the following advantages over an ordinary table:

  1. High query performance: The system queries only the concerned partitions rather than the whole table, improving the query efficiency.
  2. High availability: If a partition is faulty, data in the other partitions is still available.
  3. Easy maintenance: You only need to fix the faulty partition.

The partitioned tables supported by GaussDB(DWS) include range partitioned tables and list partitioned tables. (List partitioned tables are supported only in cluster 8.1.3).

Using Partial Clustering

Partial Cluster Key is the column-based technology. It can minimize or maximize sparse indexes to quickly filter base tables. Partial cluster key can specify multiple columns, but you are advised to specify no more than two columns. Use the following principles to specify columns:

  1. The selected columns must be restricted by simple expressions in base tables. Such constraints are usually represented by Col, Op, and Const. Col specifies the column name, Op specifies operators, (including =, >, >=, <=, and <) Const specifies constants.
  2. Select columns that are frequently selected (to filter much more undesired data) in simple expressions.
  3. List the less frequently selected columns on the top.
  4. List the columns of the enumerated type at the top.

Selecting a Data type

You can use data types with the following features to improve efficiency:

  1. Data types that boost execution efficiency

    Generally, the calculation of integers (including common comparison calculations, such as =, >, <, ≥, ≤, and ≠ and GROUP BY) is more efficient than that of strings and floating point numbers. For example, if you need to perform a point query on a column-store table whose NUMERIC column is used as a filter criterion, the query will take over 10 seconds. If you change the data type from NUMERIC to INT, the query takes only about 1.8 seconds.

  2. Selecting data types with a short length

    Data types with short length reduce both the data file size and the memory used for computing, improving the I/O and computing performance. For example, use SMALLINT instead of INT, and INT instead of BIGINT.

  3. Same data type for a join

    You are advised to use the same data type for a join. To join columns with different data types, the database needs to convert them to the same type, which leads to additional performance overheads.