Table Design
GaussDB uses a distributed architecture. Data is distributed on DNs. Generally, well-designed table must comply with the following rules:
- Evenly distribute data on each DN to prevent data skew. If most data is stored on several DNs, the effective capacity of a cluster decreases. Select a proper distribution key to avoid data skew.
- Evenly scan each DN when querying tables. Otherwise, DNs most frequently scanned will become the performance bottleneck. For example, when you use equivalent filter conditions on a fact table, the nodes are not evenly scanned.
- Reduce the amount of data to be scanned. You can use the pruning mechanism of a partitioned table.
- Minimize random I/Os. Through clustering, you can sequentially store hot data, converting random I/O to sequential I/O to reduce the cost of I/O scanning.
- Avoid data shuffle. To shuffle data is to physically transfer it from one node to another. This unnecessarily occupies many network resources. To reduce network pressure, locally process data, and to improve cluster performance and concurrency, you can minimize data shuffling by using proper association and grouping conditions.
Selecting a Distribution Mode
Distribution Mode |
Description |
Application Scenario |
---|---|---|
Hash |
Table data is distributed on all DNs in a cluster by hash. |
Fact tables containing a large amount of data |
Replication |
Full data in a table is stored on every DN in the cluster. |
Dimension tables and fact tables containing a small amount of data |
Range |
Table data is mapped to specified columns based on the range and distributed to the corresponding DNs. |
Users need to customize distribution rules. |
List |
Table data is mapped to specified columns based on specific values and distributed to corresponding DNs. |
Users need to customize distribution rules. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
-- Define a table with each row stored in all DNs. CREATE TABLE warehouse_d1 ( W_WAREHOUSE_SK INTEGER NOT NULL, W_WAREHOUSE_ID CHAR(16) NOT NULL, W_WAREHOUSE_NAME VARCHAR(20) , W_WAREHOUSE_SQ_FT INTEGER , W_STREET_NUMBER CHAR(10) , W_STREET_NAME VARCHAR(60) , W_STREET_TYPE CHAR(15) , W_SUITE_NUMBER CHAR(10) , W_CITY VARCHAR(60) , W_COUNTY VARCHAR(30) , W_STATE CHAR(2) , W_ZIP CHAR(10) , W_COUNTRY VARCHAR(20) , W_GMT_OFFSET DECIMAL(5,2) )DISTRIBUTE BY REPLICATION; -- Define a table using HASH distribution. CREATE TABLE warehouse_d2 ( W_WAREHOUSE_SK INTEGER NOT NULL, W_WAREHOUSE_ID CHAR(16) NOT NULL, W_WAREHOUSE_NAME VARCHAR(20) , W_WAREHOUSE_SQ_FT INTEGER , W_STREET_NUMBER CHAR(10) , W_STREET_NAME VARCHAR(60) , W_STREET_TYPE CHAR(15) , W_SUITE_NUMBER CHAR(10) , W_CITY VARCHAR(60) , W_COUNTY VARCHAR(30) , W_STATE CHAR(2) , W_ZIP CHAR(10) , W_COUNTRY VARCHAR(20) , W_GMT_OFFSET DECIMAL(5,2), CONSTRAINT W_CONSTR_KEY3 UNIQUE(W_WAREHOUSE_SK) )DISTRIBUTE BY HASH(W_WAREHOUSE_SK); -- Define a table using RANGE distribution. CREATE TABLE warehouse_d3 ( W_WAREHOUSE_SK INTEGER NOT NULL, W_WAREHOUSE_ID CHAR(16) NOT NULL, W_WAREHOUSE_NAME VARCHAR(20) , W_WAREHOUSE_SQ_FT INTEGER , W_STREET_NUMBER CHAR(10) , W_STREET_NAME VARCHAR(60) , W_STREET_TYPE CHAR(15) , W_SUITE_NUMBER CHAR(10) , W_CITY VARCHAR(60) , W_COUNTY VARCHAR(30) , W_STATE CHAR(2) , W_ZIP CHAR(10) , W_COUNTRY VARCHAR(20) , W_GMT_OFFSET DECIMAL(5,2) )DISTRIBUTE BY RANGE(W_WAREHOUSE_ID) ( SLICE s1 VALUES LESS THAN (10) DATANODE dn1, SLICE s2 VALUES LESS THAN (20) DATANODE dn2, SLICE s3 VALUES LESS THAN (30) DATANODE dn3, SLICE s4 VALUES LESS THAN (MAXVALUE) DATANODE dn4 ); -- Define a table using LIST distribution. CREATE TABLE warehouse_d4 ( W_WAREHOUSE_SK INTEGER NOT NULL, W_WAREHOUSE_ID CHAR(16) NOT NULL, W_WAREHOUSE_NAME VARCHAR(20) , W_WAREHOUSE_SQ_FT INTEGER , W_STREET_NUMBER CHAR(10) , W_STREET_NAME VARCHAR(60) , W_STREET_TYPE CHAR(15) , W_SUITE_NUMBER CHAR(10) , W_CITY VARCHAR(60) , W_COUNTY VARCHAR(30) , W_STATE CHAR(2) , W_ZIP CHAR(10) , W_COUNTRY VARCHAR(20) , W_GMT_OFFSET DECIMAL(5,2) )DISTRIBUTE BY LIST(W_COUNTRY) ( SLICE s1 VALUES ('USA') DATANODE dn1, SLICE s2 VALUES ('CANADA') DATANODE dn2, SLICE s3 VALUES ('UK') DATANODE dn3, SLICE s4 VALUES (DEFAULT) DATANODE dn4 ); |
For details about the table distribution syntax, see CREATE TABLE.
Selecting a Distribution Key
A distribution key is important for a distribution table. An improper distribution key may cause data skew. As a result, the I/O load is heavy on several DNs, affecting the overall query performance. Therefore, after determining the distribution policy of a distribution table, you need to check the table data skew to ensure that data is evenly distributed. Comply with the following rules to select a distribution key:
- Select a column containing discrete data as the distribution key, so that data can be evenly distributed on each DN. If the data in a single column is not discrete enough, consider using multiple columns as distribution keys. You can select the primary key of a table as the distribution key. For example, in an employee information table, select the certificate number column as the distribution key.
- If the first rule is met, do not select a column having constant filter conditions as the distribution key. For example, in a query on the dwcjk table, if the zqdh column contains the constant filter condition zqdh='000001', avoid selecting the zqdh column as the distribution key.
- If the first and second rules are met, select the join conditions in a query as distribution keys. If a join condition is used as a distribution key, the data involved in a join task is locally distributed on DNs, which greatly reduces the data flow cost among DNs.
Selecting a Partitioning Mode
Comply with the following rules to partition a table containing a large amount of data:
- Create partitions on columns that indicate certain ranges, such as dates and regions.
- A partition name should show the data characteristics of a partition. For example, its format can be Keyword+Range characteristics.
- Set the upper limit of a partition to MAXVALUE to prevent data overflow.
Partitioning Mode |
Description |
---|---|
Range |
Table data is partitioned by range. |
List |
Table data is partitioned by a specified column based on a specific value. |
Hash |
Table data is partitioned by hash. |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
-- Create a range partitioned table. CREATE TABLE staffS_p1 ( staff_ID NUMBER(6) not null, FIRST_NAME VARCHAR2(20), LAST_NAME VARCHAR2(25), EMAIL VARCHAR2(25), PHONE_NUMBER VARCHAR2(20), HIRE_DATE DATE, employment_ID VARCHAR2(10), SALARY NUMBER(8,2), COMMISSION_PCT NUMBER(4,2), MANAGER_ID NUMBER(6), section_ID NUMBER(4) ) PARTITION BY RANGE (HIRE_DATE) ( PARTITION HIRE_19950501 VALUES LESS THAN ('1995-05-01 00:00:00'), PARTITION HIRE_19950502 VALUES LESS THAN ('1995-05-02 00:00:00'), PARTITION HIRE_maxvalue VALUES LESS THAN (MAXVALUE) ); -- Create a list partitioned table. CREATE TABLE test_list (col1 int, col2 int) partition by list(col1) ( partition p1 values (2000), partition p2 values (3000), partition p3 values (4000), partition p4 values (5000) ); -- Create a hash partitioned table. CREATE TABLE test_hash (col1 int, col2 int) partition by hash(col1) ( partition p1, partition p2 ); |
For details about the table partition syntax, see CREATE TABLE PARTITION.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot