Creating and Managing Tables
Creating a Table
You can run the CREATE TABLE command to create a table. When creating a table, you can define the following information:
- Columns and data type of the table.
- Table or column constraints that restrict a column or the data contained in a table. For details, see Definition of Table Constraints.
- Distribution policy of a table, determines how the GaussDB (DWS) database divides data between segments. For details, see Definition of Table Distribution.
- Table storage format. For details, see Selecting a Table Storage Mode.
- Partition table information. For details, see Defining Table Partitions.
Example: Use CREATE TABLE to create a table web_returns_p1, use wr_item_sk as the distribution key, and sets the range distribution function through wr_returned_date_sk.
Definition of Table Constraints
You can define constraints on columns and tables to restrict data in a table. However, there are the following restrictions:
- The primary key constraint and unique constraint in the table must contain a distribution column.
- Column-store tables support the PARTIAL CLUSTER KEY and table-level primary key and unique constraints, but do not support table-level foreign key constraints.
- Only the NULL, NOT NULL, and DEFAULT constant values can be used as column-store table column constraints.
- CHECK constraint
A CHECK constraint allows you to specify that values in a specific column must satisfy a Boolean (true) expression. For example, the product price must be positive.
- NOT NULL constraint
A NOT NULL constraint specifies that a column cannot have null values. A non-null constraint is always written as a column constraint. For example:
- UNIQUE constraint
A UNIQUE constraint specifies that the values in a column or a group of columns are all unique. If DISTRIBUTE BY REPLICATION is not specified, the column table that contains only unique values must contain distribution columns.
- Primary key
A primary key constraint is the combination of a UNIQUE constraint and a NOT NULL constraint. If DISTRIBUTE BY REPLICATION is not specified, the column set with a primary key constraint must contain distributed columns. If a table has a primary key, the column (or group of columns) of the primary key is selected as the distribution keys of the table by default.
For example:
- Partial cluster key
Definition of Table Distribution
- GaussDB(DWS) supports the following distribution modes: replication, hash, and roundrobin.
The roundrobin distribution mode is supported only by cluster version 8.1.2 or later.
Policy
Description
Scenario
Advantages/Disadvantages
Replication
Full data in a table is stored on each DN in the cluster.
Small tables and dimension tables
- The advantage of replication is that each DN has full data of the table. During the join operation, data does not need to be redistributed, reducing network overheads and reducing plan segments (each plan segment starts a corresponding thread).
- The disadvantage of replication is that each DN retains the complete data of the table, resulting in data redundancy. Generally, replication is only used for small dimension tables.
Hash
Table data is distributed on all DNs in the cluster.
Fact tables containing a large amount of data
- The I/O resources of each node can be used during data read/write, greatly improving the read/write speed of a table.
- Generally, a large table (containing over 1 million records) is defined as a hash table.
Polling (Round-robin)
Each row in the table is sent to each DN in turn. Data can be evenly distributed on each DN.
Fact tables that contain a large amount of data and cannot find a proper distribution column in hash mode
- Round-robin can avoid data skew, improving the space utilization of the cluster.
- Round-robin does not support local DN optimization like a hash table does, and the query performance of Round-robin is usually lower than that of a hash table.
- If a proper distribution column can be found for a large table, use the hash distribution mode with better performance. Otherwise, define the table as a round-robin table.
- Selecting a Distribution Key
If the hash distribution mode is used, a distribution key must be specified for the user table. When a record is inserted, the system hashes it based on the distribution key and then stores it on the corresponding DN.
Select a hash distribution key based on the following principles:
- The values of the distribution key should be discrete so that data can be evenly distributed on each DN. You can select the primary key of the table as the distribution key. For example, for a person information table, choose the ID number column as the distribution key.
- Do not select the column that has a constant filter. For example, if a constant constraint (for example, zqdh= '000001') exists on the zqdh column in some queries on the dwcjk table, you are not advised to use zqdh as the distribution key.
- With the above principles met, you can select join conditions as distribution keys, so that join tasks can be pushed down to DNs for execution, reducing the amount of data transferred between the DNs.
For a hash table, an inappropriate distribution key may cause data skew or poor I/O performance on certain DNs. Therefore, you need to check the table to ensure that data is evenly distributed on each DN. You can run the following SQL statements to check for data skew:
xc_node_id corresponds to a DN. Generally, over 5% difference between the amount of data on different DNs is regarded as data skew. If the difference is over 10%, choose another distribution key.
- You are not advised to add a column as a distribution key, especially add a new column and use the SEQUENCE value to fill the column. (Sequences may cause performance bottlenecks and unnecessary maintenance costs.)
View the data in the table.
- Run the following command to query information about all tables in a database in the system catalog pg_tables:
- Run the \d+ command of the gsql tool to query table attributes:
- Run the following command to query the data volume of table customer_t1:
- Run the following command to query all data in table customer_t1:
- Run the following command to query data in column c_customer_sk:
- Run the following command to filter repeated data in column c_customer_sk:
- Run the following command to query all data whose column c_customer_sk is 3869:
- Run the following command to sort data based on column c_customer_sk.
Deleting Data in a Table
You can delete outdated data from a table by row.
SQL statements can only access and delete an independent row by declaring conditions that match the row. If a table has a primary key column, you can use it to specify a row. You can delete several rows that match the specified condition or delete all the rows from a table.
- For example, to delete all the rows whose c_customer_sk column is 3869 from table customer_t1, run the following statement:
- To delete all rows from the table, run either of the following statements:
If you need to delete an entire table, you are advised to use the TRUNCATE statement rather than DELETE.
- Delete the created table.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.