TABLE Object Design (Prioritized)
Rule 2.9: Selecting the Optimal Distribution Method and Columns During Table Creation
Impact of rule violation:
-
Incorrect distribution method and column selection can cause storage skew, deteriorate access performance, and even overload storage and computing resources.
Solution:
- When creating a table, it is important to use the DISTRIBUTE BY clause to explicitly specify the distribution method and distribution columns. The table below provides principles for selecting the distribution columns.
Distribution Method |
Description |
Scenario |
---|---|---|
Hash |
Table data is distributed to each DN based on the mapping between hash values generated by distribution columns and DNs.
|
Large tables and fact tables |
RoundRobin |
Table data is distributed to DNs in polling mode.
|
Large tables, fact tables, and tables without proper distribution columns |
Replication |
Full data in a table is copied to each DN in the cluster.
|
Small tables and dimension tables |
Rule 2.10 Selecting an Optimal Storage Type During Table Creation
Impact of rule violation:
-
Row-store tables are not properly used. As a result, the query performance is poor and resources are overloaded.
- Improper use of column-store tables causes CU expansion, poor performance, and resource overload.
Solution:
- When creating a table, use orientation to explicitly specify the storage type. The following table describes the rules for selecting a storage type.
Storage Type |
Applicable Scenario |
Inapplicable Scenario |
---|---|---|
Row storage |
|
DML query: statistical analysis query (with mass data involved in GROUP and JOIN processes)
CAUTION:
When creating a row-store table (orientation is set to row), do not specify the compress attribute or use a row-store compressed table. |
Column storage |
|
|
Rule 2.11 Selecting an Optimal Partitioning Policy During Table Creation
Impact of rule violation:
Without partitioning, query performance and data governance efficiency will deteriorate. The larger the data volume, the greater the deterioration. The advantages of partitioning include:
- High query performance: The system queries only the concerned partitions rather than the whole table, improving the query efficiency.
- Improved data governance efficiency: In the data lifecycle management scenario, performing TRUNCATE or DROP PARTITION on historical partitions is much more efficient and effective than using DELETE.
Solution:
- Design partitions for tables that contain fields of the time type.
Partitioning Policy |
Description |
Scenario |
---|---|---|
Range partitioning |
Data is stored in different partitions based on the range of partition key values. The partition key ranges are consecutive but not overlapped. |
|
List partitioning |
Partitioning is performed based on a unique list of partition key values. |
|
Suggestion 2.12: Designing Table Columns for Fast and Accurate Queries
Impact of rule violation:
- The system may have limited storage space and low query efficiency.
Solution:
- Design the table columns well for fast queries.
- If possible, use integers instead of floating points or characters.
- When using variable-length character type, specify the maximum length based on data features.
- Design the table columns well for accurate queries.
- Use the time type instead of the character type to store time data.
- Use the minimum numeric type that meets the requirements. Avoid using bigint if int or smallint is sufficient to save space.
- Correctly use the constraints.
- Add NOT NULL constraints to columns that never have NULL values. The optimizer automatically optimizes the columns in certain scenarios.
- Do not use the DEFAULT constraint for fields that can be supplemented at the service layer. Otherwise, unexpected results may be generated during data loading.
- Avoid unnecessary data type conversion.
- In tables that are logically related, columns having the same meaning should use the same data type.
- Different types of comparison operations cause data type conversion, which may cause index and partition pruning failures and affect query performance.
Suggestion 2.13: Avoiding the Usage of Auto-increment Columns or Data Types
Impact of rule violation:
-
When auto-increment sequences or data types are heavily used, the GTM may become overloaded and slow down sequence generation.
Solution:
- Set a UUID to obtain a unique ID.
- If the auto-increment sequence must be used and there is no strict requirement for increasing order, you can set the cache, for example, 1000, to reduce the pressure on GTM.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.