Procedure

Table design involves distribution mode and key design, data type design, partitioning policies, constraint configuration, index design, and storage parameter optimization.

Distribution Mode and Key Design

Adhere to the following principles when selecting a table distribution mode.

Distribution Mode	Description	Applicable Scenario
Hash	Table data is distributed across all DNs in the cluster by hash.	Tables with a large data size
Replication	Each DN in the cluster holds a complete set of table data.	Dimension tables and tables with a small data size
Range	Table data is distributed to related DNs by mapping the specified columns based on a specific range.	Custom distribution rules
List	Table data is distributed to related DNs by mapping the specified columns based on specific values.	Custom distribution rules

Distribution key selection
When working with a hash table, selecting the right distribution key is essential. An inappropriate distribution key can result in data skew, causing heavy I/O loads on specific DNs and impacting overall query performance. Therefore, after determining the distribution policy of a distributed table, you need to check the table data skew to ensure that data is evenly distributed. When selecting a distribution key, adhere to the following principles:
- Use a column with discrete data as the distribution key to evenly distribute data across all DNs. If a column lacks sufficient discreteness, consider using multiple columns as distribution keys. The primary key of a table can also serve as a distribution key, such as the ID number column in an employee information table.
- When the first principle is satisfied, avoid selecting a column with constant filter conditions as the distribution key.
- When both the first and second principles are satisfied, use the join conditions in queries as distribution keys. This will result in data from join tasks being distributed on the local DN, significantly reducing data flow costs among DNs.

Below is simple example that shows how to design a distribution mode and key using syntax.

Create a table and select a distribution key and mode.

-- Replication distribution
gaussdb=#CREATE TABLE tb_t1(c1 int, c2 int)DISTRIBUTE BY REPLICATION;

-- Hash distribution
gaussdb=#CREATE TABLE tb_t2(c1 int,c2 int)DISTRIBUTE BY HASH(c1);

-- Range distribution
gaussdb=#CREATE TABLE tb_t3(c1 int,c2 int) 
DISTRIBUTE BY RANGE(c1)(
    SLICE s1 VALUES LESS THAN (100),
    SLICE s2 VALUES LESS THAN (200),
    SLICE s3 VALUES LESS THAN (MAXVALUE) 
);
gaussdb=#CREATE TABLE tb_t4(c1 int,c2 int)
DISTRIBUTE BY RANGE(c1)(
    SLICE s1 START (1) END (100),
    SLICE s2 START (100) END (200),
    SLICE s3 START (200) END (MAXVALUE) 
);

-- List distribution
gaussdb=#CREATE TABLE tb_t5(id INT,name VARCHAR(20),country VARCHAR(30))
DISTRIBUTE BY LIST(country)(
    SLICE s1 VALUES ('China'),
    SLICE s2 VALUES ('USA'),
    SLICE s3 VALUES (DEFAULT) 
);

-- Drop the created table objects.
gaussdb=#DROP TABLE tb_t1,tb_t2,tb_t3,tb_t4,tb_t5;

Data Type Design

To improve query efficiency, adhere to the following principles when designing data types:

Select efficient data types in the following order of priority: Integer > Floating-point number > Numeric, provided that they all meet the required service precision.
In tables that are logically related, columns with the same meaning should use the same data type.
When dealing with string data, it is essential to choose between fixed-length or variable-length character types based on the specific situation. Data types like varchar and char require specifying a maximum length. This length must be sufficient to store all potential data while also considering storage space to prevent resource wastage.

When designing a specific column, select a data type that matches its data characteristics. For details about the data types supported by GaussDB, see "SQL Reference > Data Types" in Developer Guide.

Partitioning Policies

Overview
Partitioning is a database optimization technology that divides a large table into multiple partitions based on specific rules to enhance query and maintenance efficiency. The partitioned table functions as a logical table that does not store data directly. Instead, data is stored within these partitions and can be distributed across different storage devices. GaussDB currently supports range partitioning, hash partitioning, and list partitioning. Here are the advantages and disadvantages of using partitioned tables:
- Advantages:
  - Improved query performance: Reducing the scanned data size significantly enhances query performance.
  - Optimized storage: Distributing partitions across various storage media helps balance performance and costs.
  - Improved maintainability: Maintenance operations such as data cleanup and index rebuild for partitioned tables can be carried out at the partition level, minimizing the impact on the overall system.
  - Improved concurrency: Partitioned tables enable parallel processing of multiple partitions, resulting in improved concurrency. For instance, multiple queries can access different partitions simultaneously without causing interference.
- Disadvantages:
  - Memory usage: A partitioned table typically consumes around (Number of partitions × 3/1,024) MB of memory. If there are too many partitions causing memory shortages, performance may decline significantly.
  - Complexity of partitioning policies: Technical knowledge and experience are required to develop and implement appropriate partitioning policies. Selecting an inappropriate partitioning policy can lead to uneven data distribution, thereby impacting performance.
  - Complexity of backup and restoration: While it is possible to back up and restore partitions individually, this also implies the need for more detailed backup policies and management efforts.
Usage scenarios
- High query performance: When a table contains a significant amount of data and certain data features are frequently accessed in a particular scenario, you can reduce the scanned data size during queries to enhance query performance. This is particularly useful for tables that are regularly analyzed on a monthly, quarterly, or yearly basis.
- Balance between performance and costs: When a table contains a significant amount of data, it is advisable to store cold data (infrequently accessed data) on low-cost storage, while keeping hot data (frequently accessed data) on high-performance storage.
- Large table management: Tables containing a significant amount of data may need to be stored across multiple storage media.
Precautions about design
- Selection of partition keys:
  Selecting partition keys for partitioned tables is a critical design decision as it directly affects the performance, maintainability, and data management efficiency of the database.
  - Query optimization: Specify frequently queried columns as partition keys. For instance, if a table is often queried by date, it is advisable to select the date column as the partition key.
  - Data distribution: Consider the distribution of data when selecting partition keys. This helps prevent situations where some partitions store an excessive amount of data while others store only a small amount.
  - Partition quantity and management: Limit the number of partitions. Creating excessive partitions can lead to increased management complexity and performance decline.
- Selection of partition types:
  - Range partitioning: This type is ideal for partition keys with consecutive values, such as time.
  - List partitioning: This type is suitable for partition keys with discrete values that fall into a limited number of categories, such as regions or status codes.
  - Hash partitioning: This type is designed for evenly distributed data, such as user IDs.

Below is simple example that shows how to design a partitioning policy using syntax.

Create partitioned tables.

-- Range partitioned table
gaussdb=#CREATE TABLE tb_t1(id INT,info VARCHAR(20)) 
PARTITION BY RANGE (id) (
    PARTITION p1 START(1) END(600) EVERY(200),    
    PARTITION p2 START(600) END(800),
    PARTITION pmax START(800) END(MAXVALUE)
);
gaussdb=#CREATE TABLE tb_t2(
    id INT, 
    info VARCHAR(20)
) PARTITION BY RANGE (id) (
    PARTITION p1 VALUES LESS THAN (100), 
    PARTITION p2 VALUES LESS THAN (200),
    PARTITION p3 VALUES LESS THAN (MAXVALUE)
);

-- List partitioned table
gaussdb=#CREATE TABLE tb_t3(NAME VARCHAR ( 50 ), area VARCHAR ( 50 )) 
PARTITION BY LIST (area) (
    PARTITION p1 VALUES ('bj'),
    PARTITION p2 VALUES ('sh'),
    PARTITION pdefault VALUES (DEFAULT)
);


-- Hash partitioned table
gaussdb=#CREATE TABLE tb_t4(c1 int) PARTITION BY HASH(c1) PARTITIONS 3;
gaussdb=#CREATE TABLE tb_t5(c1 int) PARTITION BY HASH(C1)(
    PARTITION pa,
    PARTITION pb,
    PARTITION pc
);

-- Drop the created table objects.
gaussdb=#DROP TABLE tb_t1,tb_t2,tb_t3,tb_t4,tb_t5;

Constraint Configuration

When creating a constraint, clearly indicate the type of constraint and the table name where the constraint is applied in the constraint name. For example, name the PRIMARY KEY constraint with PK, the table name, and the column names that make up the key.
Exercise caution when selecting the DEFAULT constraint. If column values can be completed at the service level, it is not recommended to use the DEFAULT constraint.
If the NOT NULL constraint is applied to columns that are meant to always contain non-null values, the optimizer will conduct automatic optimization in certain scenarios.

Below is a simple example that shows how to add constraints using syntax.

Create a table and add constraints to it.

-- NOT NULL constraint
gaussdb=#CREATE TABLE tb_t1(id int not null,name varchar(50));

-- UNIQUE constraint
gaussdb=#CREATE TABLE tb_t2(id int UNIQUE,name varchar(50));
gaussdb=#CREATE TABLE tb_t3(id int, name varchar(50),CONSTRAINT unq_t3_id UNIQUE(id));

-- PRIMARY KEY constraint
gaussdb=#CREATE TABLE tb_t4(id int PRIMARY KEY, name varchar(50));
gaussdb=#CREATE TABLE tb_t5(
    id int,
    name varchar(50),
    CONSTRAINT pk_person5_id PRIMARY KEY(id)
);

-- CHECK constraint
gaussdb=#CREATE TABLE tb_t6(name varchar(50),age int CHECK(age > 0 AND age < 200));
gaussdb=#CREATE TABLE tb_t7(
    name varchar(50),
    age int,
    CONSTRAINT chk_t6_age CHECK (age > 0 AND age < 200) 
);

-- Drop the created table objects.
gaussdb=#DROP TABLE tb_t1,tb_t2,tb_t3,tb_t4,tb_t5,tb_t6,tb_t7;

Index Design

Using indexes helps accelerate data access but can also prolong the time needed to insert, update, or delete data. Therefore, it is crucial to carefully evaluate whether to add indexes to a table and which specific columns to index. It is advisable to adhere to the following principles when setting up indexes:

Creating indexes for frequently joined columns can enhance join speed.
Creating indexes for frequently sorted columns can enhance sorting and query speed since indexes are already sorted.
Creating indexes for columns frequently used in the WHERE clause can enhance the speed of condition judgment.
A composite index consists of multiple columns. However, including more columns will result in a larger index size and higher maintenance costs.
Do not apply indexes to frequently updated columns as they can increase the maintenance costs of data updates.
All functions and operators used in an index definition must be "immutable", that is, their results must depend only on their input parameters and never on any outside influence (such as the contents of another table or the current time). This restriction ensures that the behavior of the index is well-defined. To use a user-defined function in an index or WHERE clause, remember to mark the function immutable when you create it.
There are two types of indexes for partitioned tables: local index and global index. A local index is specific to a partition within a partitioned table, while a global index spans the entire partitioned table.
It is crucial to regularly maintain indexes, and there are several scenarios in which to use REINDEX:
- An index has become corrupted, and no longer contains valid data.
- An index has become "bloated", that is, it contains many empty or nearly-empty pages.
- You have altered a storage parameter (such as fill factor) for an index, and wish to ensure that the change has taken full effect.
- An index build with the CONCURRENTLY option failed, leaving an "invalid" index.
When naming an index, make sure to include the table name and the key columns involved. For example, idx_test_c1 indicates that the index is created on the c1 column of the test table.

Below is a simple example that shows how to add an index to a table using syntax.

Create a table and add an index to it.

gaussdb=#CREATE TABLE tb_t1(id int not null,name varchar(50));

-- Add an index to the table.
gaussdb=#CREATE INDEX idx_t1_id ON tb_t1(id);

-- Drop the created table object.
gaussdb=#DROP TABLE tb_t1;

Storage Parameter Optimization

Fill factor
The fill factor for a table is a percentage between 10 and 100. If Ustore is in use, the default value is 92. If Astore is in use, the default value is 100 (complete packing). When you specify a smaller fillfactor, INSERT operations pack table pages only to the indicated percentage. The remaining space on each page is reserved for updating rows on that page. This gives UPDATE a chance to place the updated copy of a row on the same page as the original, which is more efficient than placing it on a different page. For tables that are never updated, it is best to use a fill factor of 100. However, for heavily updated tables, smaller fill factors are appropriate. Below is an example:
```
CREATE TABLE test(c1 int,c2 int) WITH (FILLFACTOR = 80);
```
Storage engine
Specifies the storage engine type. Once set, this parameter cannot be modified. Below is an example:
```
CREATE TABLE test(c1 int,c2 int) WITH (STORAGE_TYPE = USTORE);
```
- USTORE: The table uses an in-place update storage engine. To prevent space expansion, be sure to enable the track_counts and track_activities parameters.
- ASTORE: The table uses an append-only storage engine.