Table Design Specifications

[Rule] View nesting is not recommended.
[Recommendation] Try to avoid collation operations when defining a view.
ORDER BY is invalid in the top-level view. If you must collate the output data, use ORDER BY in a called view.
[Description] Evenly scan each DN when querying tables. Otherwise, DNs most frequently scanned will become the performance bottleneck. For example, when you use equivalent filter conditions on a fact table, the nodes are not evenly scanned.
[Recommendation] Minimize random I/Os. Through clustering, you can sequentially store hot data, converting random I/O to sequential I/O to reduce the cost of I/O scanning.
[Recommendation] Try to avoid data shuffling. To shuffle data is to physically transfer it from one node to another. This unnecessarily occupies many network resources. To reduce network pressure, locally process data, and improve cluster performance and concurrency, you can minimize data shuffling by using proper join and grouping conditions.

Selecting a Storage Model

[Recommendation] Selecting a storage model is the first step in defining a table. The storage model mainly depends on the customer's service type. For details, see Table 1.

**Table 1** Table storage models and scenarios
Storage Model	Application Scenario
Row store	Point queries (simple index-based queries that only return a few records) Scenarios requiring frequent addition, deletion, and modification

Selecting a Distribution Mode

Comply with the following rules to distribute table data.

**Table 2** Table distribution modes and scenarios
Distribution Mode	Description	Application Scenario
Hash	Table data is distributed on all DNs in a cluster by hash.	Fact tables containing a large amount of data.
Replication	Full data in a table is stored on every DN in the cluster.	Dimension tables and fact tables containing a small amount of data.
Range	Table data is mapped to specified columns based on the range and distributed to the corresponding DNs.	Users need to customize distribution rules.
List	Table data is mapped to specified columns based on specific values and distributed to corresponding DNs.	Users need to customize distribution rules.

When hash, range, or list distribution is specified, the primary key and unique index to be created must contain distribution keys.

[Description] Evenly distribute table data on each DN to prevent data skew. If most data is stored on several DNs, the effective capacity of a cluster decreases. Select a proper distribution key to avoid data skew.
[Rule] DISTRIBUTE BY must be specified and the table distribution policy must comply with the following principles.
GaussDB currently offers four table distribution strategies: replication, hash, range, and list. REPLICATION retains a complete data table on each node. HASH distributes table data to multiple nodes based on the provided distribution key values. Range and list map data to corresponding target nodes based on the value of the distribution key falling within a specific range or specific values.
- If the number of data records in a system configuration table or data dictionary table is less than 20 million and the insert and update operations are of low frequency, the replication distribution policy is recommended.
```
CREATE TABLE t1 (contentId INT) DISTRBUTE BY REPLICATION;
```
  Exercise caution when using the REPLICATION distribution. This distributed table may cause space expansion and DML performance deterioration.
- For tables that do not meet the preceding requirement, that is, tables with large data volumes and high update frequency, data must be sharded and the hash distribution policy must be used. The distribution key must be one or more columns in the primary key.
```
CREATE TABLE t1 (contentId INT) DISTRBUTE BY HASH (contentId);
```
- Users can use the range and list distribution policies to customize distribution rules.
```
CREATE TABLE t1 (contentId INT) DISTRBUTE BY Range (contentId)
(
SLICE s1 VALUES LESS THAN (10) DATANODE dn1,
SLICE s2 VALUES LESS THAN (20) DATANODE dn2,
SLICE s3 VALUES LESS THAN (30) DATANODE dn3,
SLICE s4 VALUES LESS THAN (MAXVALUE) DATANODE dn4
);
CREATE TABLE t1 (contentId INT) DISTRBUTE BY List (contentId)
(
SLICE s1 VALUES (10) DATANODE dn1,
SLICE s2 VALUES (20) DATANODE dn2,
SLICE s3 VALUES (30) DATANODE dn3,
SLICE s4 VALUES (DEFAULT) DATANODE dn4
);
```
- You are advised not to shard a table with less than 20 million data records, has multiple rows of data inserted and updated, and has a high frequency of range query. Currently, such tables are not supported but will be provided in the next version. Therefore, the hash-distributed table is used.

The example of defining a distributed table is as follows:

     
      
        
        
          -- Define a table with each row stored in all DNs.
CREATE TABLE warehouse_d1
(
    W_WAREHOUSE_SK            INTEGER               NOT NULL,
    W_WAREHOUSE_ID            CHAR(16)              NOT NULL,
    W_WAREHOUSE_NAME          VARCHAR(20)                   ,
    W_WAREHOUSE_SQ_FT         INTEGER                       ,
    W_STREET_NUMBER           CHAR(10)                      ,
    W_STREET_NAME             VARCHAR(60)                   ,
    W_STREET_TYPE             CHAR(15)                      ,
    W_SUITE_NUMBER            CHAR(10)                      ,
    W_CITY                    VARCHAR(60)                   ,
    W_COUNTY                  VARCHAR(30)                   ,
    W_STATE                   CHAR(2)                       ,
    W_ZIP                     CHAR(10)                      ,
    W_COUNTRY                 VARCHAR(20)                   ,
    W_GMT_OFFSET              DECIMAL(5,2)
)DISTRIBUTE BY REPLICATION;

-- Define a hash table.
CREATE TABLE warehouse_d2
(
    W_WAREHOUSE_SK            INTEGER               NOT NULL,
    W_WAREHOUSE_ID            CHAR(16)              NOT NULL,
    W_WAREHOUSE_NAME          VARCHAR(20)                   ,
    W_WAREHOUSE_SQ_FT         INTEGER                       ,
    W_STREET_NUMBER           CHAR(10)                      ,
    W_STREET_NAME             VARCHAR(60)                   ,
    W_STREET_TYPE             CHAR(15)                      ,
    W_SUITE_NUMBER            CHAR(10)                      ,
    W_CITY                    VARCHAR(60)                   ,
    W_COUNTY                  VARCHAR(30)                   ,
    W_STATE                   CHAR(2)                       ,
    W_ZIP                     CHAR(10)                      ,
    W_COUNTRY                 VARCHAR(20)                   ,
    W_GMT_OFFSET              DECIMAL(5,2),
    CONSTRAINT W_CONSTR_KEY3 UNIQUE(W_WAREHOUSE_SK)
)DISTRIBUTE BY HASH(W_WAREHOUSE_SK);

-- Define a table using range distribution.
CREATE TABLE warehouse_d3
(
W_WAREHOUSE_SK            INTEGER               NOT NULL,
W_WAREHOUSE_ID            CHAR(16)              NOT NULL,
W_WAREHOUSE_NAME          VARCHAR(20)                   ,
W_WAREHOUSE_SQ_FT         INTEGER                       ,
W_STREET_NUMBER           CHAR(10)                      ,
W_STREET_NAME             VARCHAR(60)                   ,
W_STREET_TYPE             CHAR(15)                      ,
W_SUITE_NUMBER            CHAR(10)                      ,
W_CITY                    VARCHAR(60)                   ,
W_COUNTY                  VARCHAR(30)                   ,
W_STATE                   CHAR(2)                       ,
W_ZIP                     CHAR(10)                      ,
W_COUNTRY                 VARCHAR(20)                   ,
W_GMT_OFFSET              DECIMAL(5,2)
)DISTRIBUTE BY RANGE(W_WAREHOUSE_ID)
(
   SLICE s1 VALUES LESS THAN (10) DATANODE dn1,
   SLICE s2 VALUES LESS THAN (20) DATANODE dn2,
   SLICE s3 VALUES LESS THAN (30) DATANODE dn3,
   SLICE s4 VALUES LESS THAN (MAXVALUE) DATANODE dn4
);
 
-- Define a table using list distribution.
CREATE TABLE warehouse_d4
(
W_WAREHOUSE_SK            INTEGER               NOT NULL,
W_WAREHOUSE_ID            CHAR(16)              NOT NULL,
W_WAREHOUSE_NAME          VARCHAR(20)                   ,
W_WAREHOUSE_SQ_FT         INTEGER                       ,
W_STREET_NUMBER           CHAR(10)                      ,
W_STREET_NAME             VARCHAR(60)                   ,
W_STREET_TYPE             CHAR(15)                      ,
W_SUITE_NUMBER            CHAR(10)                      ,
W_CITY                    VARCHAR(60)                   ,
W_COUNTY                  VARCHAR(30)                   ,
W_STATE                   CHAR(2)                       ,
W_ZIP                     CHAR(10)                      ,
W_COUNTRY                 VARCHAR(20)                   ,
W_GMT_OFFSET              DECIMAL(5,2)
)DISTRIBUTE BY LIST(W_COUNTRY)
(
    SLICE s1 VALUES ('USA') DATANODE dn1,
    SLICE s2 VALUES ('CANADA') DATANODE dn2,
    SLICE s3 VALUES ('UK') DATANODE dn3,
    SLICE s4 VALUES (DEFAULT) DATANODE dn4
);

         

       

     
    

For details about the table distribution syntax, see CREATE TABLE.

Selecting a Distribution Key

A distribution key is important for a distributed table. An improper distribution key may cause data skew. As a result, the I/O load is heavy on several DNs, affecting the overall query performance. Therefore, after determining the distribution policy of a distributed table, you need to check the table data skew to ensure that data is evenly distributed. Comply with the following rules to select a distribution key:

[Recommendation] Select a column containing discrete data as the distribution key, so that data can be evenly distributed on each DN. If the data in a single column is not discrete enough, consider using multiple columns as distribution keys. You can select the primary key of a table as the distribution key. For example, in an employee information table, select the certificate number column as the distribution key.
[Recommendation] If the first rule is met, do not select a column having constant filter conditions as the distribution key. For example, in a query on the dwcjk table, if the zqdh column contains the constant filter condition zqdh='000001', avoid selecting the zqdh column as the distribution key.
[Recommendation] If the first and second rules are met, select the join conditions in a query as distribution keys. If a join condition is used as a distribution key, the data involved in a join task is locally distributed on DNs, which greatly reduces the data flow cost among DNs.
[Recommendation] It is recommended that the number of distribution key columns be less than or equal to 3. Too many columns will cause high computing overhead.

[Rule] Properly design distribution keys to facilitate query development and ensure even data storage, avoiding data skew and read hotspots.

Use columns with discrete values as distribution keys so that data can be evenly distributed to each DN.
If the preceding requirement is met, it is not recommended that columns with constant filtering be used as distribution keys. Otherwise, all query tasks will be distributed to a unique and fixed DN.
If the preceding two requirements are met, select the join condition as the distribution key. In this way, related data of the join task is distributed on the same DN, reducing the cost of data flow between DNs.
According to the database specifications, the primary key of the hash distributed table must contain the distribution key. Therefore, you can select the primary key of the table as the distribution key.

**Table 3** Common distribution keys and effects
Distribution Key Value	Distribution Effect
User ID. Many users exist in an application.	Good
Status code. Only several status codes are available.	Poor
Date when a project is created. The value is rounded off to the latest time segment (for example, day, hour, or minute).	Poor
Device ID. Each device accesses data at a relatively similar interval.	Good
Device ID. Many devices are traced. However, one device is more commonly used than all other devices.	Poor

[Rule] The length of the column used by the distribution key cannot exceed 128 characters. If the length is too long, the computing overhead is high.
[Rule] Once a distribution key is inserted, it cannot be updated unless you delete the key and insert it again.

Selecting a Partitioning Mode

Comply with the following rules to partition a table containing a large amount of data:

[Description] Reduce the amount of data to be scanned. You can use the pruning mechanism of a partitioned table.
[Recommendation] Create partitions on columns that indicate certain ranges, such as dates and regions.
[Recommendation] A partition name should show the data characteristics of a partition. For example, its format can be Keyword+Range characteristics.
[Recommendation] Set the upper limit of a partition to MAXVALUE to prevent data overflow.

The example of defining a partitioned table is as follows:

    
     
       
       
         CREATE TABLE staffS_p1
(
  staff_ID       NUMBER(6) not null,
  FIRST_NAME     VARCHAR2(20),
  LAST_NAME      VARCHAR2(25),
  EMAIL          VARCHAR2(25),
  PHONE_NUMBER   VARCHAR2(20),
  HIRE_DATE      DATE,
  employment_ID  VARCHAR2(10),
  SALARY         NUMBER(8,2),
  COMMISSION_PCT NUMBER(4,2),
  MANAGER_ID     NUMBER(6),
  section_ID     NUMBER(4)
)
PARTITION BY RANGE (HIRE_DATE)
( 
   PARTITION HIRE_19950501 VALUES LESS THAN ('1995-05-01 00:00:00'),
   PARTITION HIRE_19950502 VALUES LESS THAN ('1995-05-02 00:00:00'),
   PARTITION HIRE_maxvalue VALUES LESS THAN (MAXVALUE)
);