Selecting a Distribution Mode
Replication is to copy full data in a table to every DN in a cluster. This is suitable for tables having small record sets. Full data in a table stored on each DN avoids data redistribution during the join operation. This reduces network costs and plan segment (each having a thread), but generates much redundant data. Generally, replication is only used for small dimension tables.
In a hash table, hash values are generated for one or more columns. You can obtain the storage location of a tuple based on the mapping between DNs and the hash values. In a hash table, I/O resources on each node can be used during I/O read/write, which greatly improves the read/write speed of a table. Generally, a table containing a large amount data is defined as a hash table.
| Policy | Description | Application Scenario |
|---|---|---|
| Hash | Table data is distributed on all DNs in the cluster in hash mode. | Fact tables containing a large amount of data |
| Replication | Full data in a table is stored on each DN in the cluster. | Small tables and dimension tables. |
As shown in Figure 1, T1 is a replication table and T2 is a hash table.
Last Article: Selecting a Storage Model
Next Article: Selecting a Distribution Column

Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.