Updated on 2024-12-18 GMT+08:00

TABLE Object Design (Prioritized)

Rule 2.9: Selecting the Optimal Distribution Method and Columns During Table Creation

Impact of rule violation:

  • Incorrect distribution method and column selection can cause storage skew, deteriorate access performance, and even overload storage and computing resources.

Solution:

  • When creating a table, it is important to use the DISTRIBUTE BY clause to explicitly specify the distribution method and distribution columns. The table below provides principles for selecting the distribution columns.
Table 1 Distribution column selection

Distribution Method

Description

Scenario

Hash

Table data is distributed to each DN based on the mapping between hash values generated by distribution columns and DNs.

  • Advantage: Each DN contains only part of data, which is space-saving.
  • Disadvantage: The even distribution of data depends heavily on the selection of distribution columns. If the join condition does not include the distribution columns of each node, data communication between nodes will be required.

Large tables and fact tables

RoundRobin

Table data is distributed to DNs in polling mode.

  • Advantage: Each DN only contains a portion of the data, taking up a small amount of space. Data is evenly distributed in polling mode and does not rely on distribution columns, eliminating data skews.
  • Disadvantage: Using distribution column conditions cannot eliminate or reduce inter-node communication. In this scenario, the performance is inferior to that of HASH.

Large tables, fact tables, and tables without proper distribution columns

Replication

Full data in a table is copied to each DN in the cluster.

  • Advantage: Each DN holds the complete data of the table. The JOIN operation avoids data communication between nodes, reducing network overhead and the overhead of starting and stopping the STREAM thread.
  • Disadvantage: Each DN retains complete table data, which is redundant and occupies more storage space.

Small tables and dimension tables

Rule 2.10 Selecting an Optimal Storage Type During Table Creation

Impact of rule violation:

  • Row-store tables are not properly used. As a result, the query performance is poor and resources are overloaded.

  • Improper use of column-store tables causes CU expansion, poor performance, and resource overload.

Solution:

  • When creating a table, use orientation to explicitly specify the storage type. The following table describes the rules for selecting a storage type.
Table 2 Storage type selection

Storage Type

Applicable Scenario

Inapplicable Scenario

Row storage

  • DML addition, deletion, and modification: scenarios with many UPDATE and DELETE operations
  • DML query: point query (simple index–based query that returns only a few records)

DML query: statistical analysis query (with mass data involved in GROUP and JOIN processes)

CAUTION:

When creating a row-store table (orientation is set to row), do not specify the compress attribute or use a row-store compressed table.

Column storage

  • DML addition, deletion, and modification: INSERT batch import scenario (The number of data records imported to a single partition at a time is approximately 60,000 times the number of DNs or greater.)
  • DML query: statistical analysis query (with mass data involved in GROUP and JOIN processes)
  • DML addition, deletion, and modification: scenarios with many UPDATE/DELETE operations or a small number of INSERT operations
  • DML query: high-concurrency point query

Rule 2.11 Selecting an Optimal Partitioning Policy During Table Creation

Impact of rule violation:

Without partitioning, query performance and data governance efficiency will deteriorate. The larger the data volume, the greater the deterioration. The advantages of partitioning include:

  • High query performance: The system queries only the concerned partitions rather than the whole table, improving the query efficiency.
  • Improved data governance efficiency: In the data lifecycle management scenario, performing TRUNCATE or DROP PARTITION on historical partitions is much more efficient and effective than using DELETE.

Solution:

  • Design partitions for tables that contain fields of the time type.
Table 3 Partitioning policy selection

Partitioning Policy

Description

Scenario

Range partitioning

Data is stored in different partitions based on the range of partition key values. The partition key ranges are consecutive but not overlapped.

  1. The date or time field is used as the partition key.
  2. Most queries contain partition keys as filter criteria.
  3. Periodically delete data based on the partition key.

List partitioning

Partitioning is performed based on a unique list of partition key values.

  1. A specific number of enumerated values are used as partition key values.
  2. Most queries contain partition keys as filter criteria.

Suggestion 2.12: Designing Table Columns for Fast and Accurate Queries

Impact of rule violation:

  • The system may have limited storage space and low query efficiency.

Solution:

  1. Design the table columns well for fast queries.
    • If possible, use integers instead of floating points or characters.
    • When using variable-length character type, specify the maximum length based on data features.
  2. Design the table columns well for accurate queries.
    • Use the time type instead of the character type to store time data.
    • Use the minimum numeric type that meets the requirements. Avoid using bigint if int or smallint is sufficient to save space.
  3. Correctly use the constraints.
    • Add NOT NULL constraints to columns that never have NULL values. The optimizer automatically optimizes the columns in certain scenarios.
    • Do not use the DEFAULT constraint for fields that can be supplemented at the service layer. Otherwise, unexpected results may be generated during data loading.
  4. Avoid unnecessary data type conversion.
    • In tables that are logically related, columns having the same meaning should use the same data type.
    • Different types of comparison operations cause data type conversion, which may cause index and partition pruning failures and affect query performance.

Suggestion 2.13: Avoiding the Usage of Auto-increment Columns or Data Types

Impact of rule violation:

  • When auto-increment sequences or data types are heavily used, the GTM may become overloaded and slow down sequence generation.

Solution:

  • Set a UUID to obtain a unique ID.
  • If the auto-increment sequence must be used and there is no strict requirement for increasing order, you can set the cache, for example, 1000, to reduce the pressure on GTM.