Updated on 2023-12-29 GMT+08:00

Creating and Managing Tables

Creating a Table

You can run the CREATE TABLE command to create a table. When creating a table, you can define the following information:

Example: Use CREATE TABLE to create a table web_returns_p1, use wr_item_sk as the distribution key, and sets the range distribution function through wr_returned_date_sk.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
CREATE TABLE web_returns_p1
(
    wr_returned_date_sk       integer,
    wr_returned_time_sk       integer,
    wr_item_sk                integer NOT NULL,
    wr_refunded_customer_sk   integer
)
WITH (orientation = column)
DISTRIBUTE BY HASH (wr_item_sk)
PARTITION BY RANGE(wr_returned_date_sk)
(
    PARTITION p2019 START(20191231) END(20221231) EVERY(10000),
    PARTITION p0 END(maxvalue)
);

Definition of Table Constraints

You can define constraints on columns and tables to restrict data in a table. However, there are the following restrictions:

  • The primary key constraint and unique constraint in the table must contain a distribution column.
  • Column-store tables support the PARTIAL CLUSTER KEY and table-level primary key and unique constraints, but do not support table-level foreign key constraints.
  • Only the NULL, NOT NULL, and DEFAULT constant values can be used as column-store table column constraints.
Table 1 Table constraints

Constraint

Description

Example

Check constraint

A CHECK constraint allows you to specify that values in a specific column must satisfy a Boolean (true) expression.

Create the products table. The price column must be positive.

1
2
3
4
5
6
CREATE TABLE products
(
   product_no integer, 
   name text, 
   price numeric CHECK (price > 0)
);

NOT NULL constraint

A NOT NULL constraint specifies that a column cannot have null values. A non-null constraint is always written as a column constraint.

Create the products table. The values of product_no and name cannot be null.

1
2
3
4
5
6
CREATE TABLE products 
( 
    product_no integer NOT NULL,
    name text NOT NULL,
    price numeric 
);

UNIQUE constraint

A UNIQUE constraint specifies that the values in a column or a group of columns are all unique. If DISTRIBUTE BY REPLICATION is not specified, the column table that contains only unique values must contain distribution columns.

Create the products table. The values of product_no must be unique.

1
2
3
4
5
6
CREATE TABLE products 
(
    product_no integer UNIQUE, 
    name text, 
    price numeric
)DISTRIBUTE BY HASH(product_no);

Primary key constraint

A primary key constraint is the combination of a UNIQUE constraint and a NOT NULL constraint. If DISTRIBUTE BY REPLICATION is not specified, the column set with a primary key constraint must contain distributed columns. If a table has a primary key, the column (or group of columns) of the primary key is selected as the distribution keys of the table by default.

Create the products table. The primary key constraint is product_no.

1
2
3
4
5
6
CREATE TABLE products 
( 
    product_no integer PRIMARY KEY, 
    name text, 
    price numeric
)DISTRIBUTE BY HASH(product_no);

Partial cluster key

Partial cluster key can minimize or maximize sparse indexes to quickly filter base tables. Partial cluster key can specify multiple columns, but you are advised to specify no more than two columns.

Create the products table with PCK set to product_no:

1
2
3
4
5
6
7
CREATE TABLE products 
(
    product_no integer, 
    name text, 
    price numeric,
    PARTIAL CLUSTER KEY(product_no)
) WITH (ORIENTATION = COLUMN);

Definition of Table Distribution

GaussDB(DWS) supports the following distribution modes: replication, hash, and roundrobin.

The roundrobin distribution mode is supported only by cluster version 8.1.2 or later.

Policy

Description

Scenario

Advantages/Disadvantages

Replication

Full data in a table is stored on each DN in the cluster.

Small tables and dimension tables

  • The advantage of replication is that each DN has full data of the table. During the join operation, data does not need to be redistributed, reducing network overheads and reducing plan segments (each plan segment starts a corresponding thread).
  • The disadvantage of replication is that each DN retains the complete data of the table, resulting in data redundancy. Generally, replication is only used for small dimension tables.

Hash

Table data is distributed on all DNs in the cluster.

Fact tables containing a large amount of data

  • The I/O resources of each node can be used during data read/write, greatly improving the read/write speed of a table.
  • Generally, a large table (containing over 1 million records) is defined as a hash table.

Polling (Round-robin)

Each row in the table is sent to each DN in turn. Data can be evenly distributed on each DN.

Fact tables that contain a large amount of data and cannot find a proper distribution column in hash mode

  • Round-robin can avoid data skew, improving the space utilization of the cluster.
  • Round-robin does not support local DN optimization like a hash table does, and the query performance of Round-robin is usually lower than that of a hash table.
  • If a proper distribution column can be found for a large table, use the hash distribution mode with better performance. Otherwise, define the table as a round-robin table.

Selecting a Distribution Key

If the hash distribution mode is used, a distribution key must be specified for the user table. When a record is inserted, the system hashes it based on the distribution key and then stores it on the corresponding DN.

Select a hash distribution key based on the following principles:

  1. The values of the distribution key should be discrete so that data can be evenly distributed on each DN. You can select the primary key of the table as the distribution key. For example, for a person information table, choose the ID number column as the distribution key.
  2. Do not select the column that has a constant filter. For example, if a constant constraint (for example, zqdh= '000001') exists on the zqdh column in some queries on the dwcjk table, you are not advised to use zqdh as the distribution key.
  3. With the above principles met, you can select join conditions as distribution keys, so that join tasks can be pushed down to DNs for execution, reducing the amount of data transferred between the DNs.

    For a hash table, an inappropriate distribution key may cause data skew or poor I/O performance on certain DNs. Therefore, you need to check the table to ensure that data is evenly distributed on each DN. You can run the following SQL statements to check for data skew:

    1
    2
    3
    4
    5
    select 
    xc_node_id, count(1) 
    from tablename 
    group by xc_node_id 
    order by xc_node_id desc;
    

    xc_node_id corresponds to a DN. Generally, over 5% difference between the amount of data on different DNs is regarded as data skew. If the difference is over 10%, choose another distribution key.

  4. You are not advised to add a column as a distribution key, especially add a new column and use the SEQUENCE value to fill the column. (Sequences may cause performance bottlenecks and unnecessary maintenance costs.)

View the data in the table.

  • Run the following command to query information about all tables in a database in the system catalog pg_tables:
    1
    SELECT * FROM pg_tables;
    
  • Run the \d+ command of the gsql tool to query table attributes:
    1
    \d+ customer_t1;
    
  • Run the following command to query the data volume of table customer_t1:
    1
    SELECT count(*) FROM customer_t1;
    
  • Run the following command to query all data in table customer_t1:
    1
    SELECT * FROM customer_t1;
    
  • Run the following command to query data in column c_customer_sk:
    1
    SELECT c_customer_sk FROM customer_t1;
    
  • Run the following command to filter repeated data in column c_customer_sk:
    1
    SELECT DISTINCT( c_customer_sk ) FROM customer_t1;
    
  • Run the following command to query all data whose column c_customer_sk is 3869:
    1
    SELECT * FROM customer_t1 WHERE c_customer_sk = 3869;
    
  • Run the following command to sort data based on column c_customer_sk.
    1
    SELECT * FROM customer_t1 ORDER BY c_customer_sk;
    

Deleting Data in a Table

You can delete outdated data from a table by row.

SQL statements can only access and delete an independent row by declaring conditions that match the row. If a table has a primary key column, you can use it to specify a row. You can delete several rows that match the specified condition or delete all the rows from a table.

  • For example, to delete all the rows whose c_customer_sk column is 3869 from table customer_t1, run the following statement:
    1
    DELETE FROM customer_t1 WHERE c_customer_sk = 3869;
    
  • To delete all rows from the table, run either of the following statements:
    1
    DELETE FROM customer_t1;
    
    1
    TRUNCATE TABLE customer_t1;
    

If you need to delete an entire table, you are advised to use the TRUNCATE statement rather than DELETE.

  • Delete the created table.
    1
    DROP TABLE customer_t1;