Updated on 2024-05-07 GMT+08:00

Selecting a Distribution Mode

In replication mode, full data in a table is copied to each DN in the cluster. This mode is used for tables containing a small volume of data. Full data in a table stored on each DN avoids data redistribution during the join operation. This reduces network costs and plan segment (each having a thread), but generates much redundant data. Generally, this mode is only used for small dimension tables.

In hash mode, hash values are generated for one or more columns. You can obtain the storage location of a tuple based on the mapping between DNs and the hash values. In a hash table, I/O resources on each node can be used during data read/write, which improves the read/write speed of a table. Generally, a table containing a large amount data is defined as a hash table.

Range distribution and list distribution are user-defined distribution policies. Values in a distribution key are within a certain range or fall into a specific value range of the corresponding target DN. The two distribution modes facilitate flexible data management which, however, requires users equipped with certain data abstraction capability.

Policy

Description

Application Scenario

Hash

Table data is distributed on all DNs in the cluster.

Fact tables containing a large amount of data

Replication

Full data in a table is stored on every DN in the cluster.

Small tables and dimension tables

Range

Table data is mapped to specified columns based on the range and distributed to the corresponding DNs.

Users need to customize distribution rules.

List

Table data is mapped to specified columns based on specific values and distributed to corresponding DNs.

Users need to customize distribution rules.

As shown in Figure 1, T1 is a replication table and T2 is a hash table.

Figure 1 Replication tables and hash tables
  • When you insert, modify, or delete data in a replication table, if you use the shippable or immutable function to encapsulate components that cannot be pushed down, data on different DNs in the replication table may be inconsistent.
  • If statements with unstable results, such as window functions, rownum, and limit clauses and user-defined functions, are used to insert or modify data in a replication table, data on different nodes may be different.