Selecting a Distribution Mode
In replication mode, full data in a table is copied to each DN in the cluster. This mode is used for tables containing a small volume of data. The advantage of this storage mode is that each DN has full data of the table. During the join operation, data does not need to be redistributed, reducing network overheads and reducing plan segments (each plan segment starts a corresponding thread). The disadvantage is that each DN retains the complete data of the table, resulting in data redundancy. Generally, this mode is only used for small dimension tables.
In hash mode, hash values are generated for one or more columns. You can obtain the storage location of a tuple based on the mapping between DNs and the hash values. In a hash table, I/O resources on each node can be used during data read/write, which improves the read/write speed of a table. Generally, a table containing a large amount data is defined as a hash table.
Range distribution and list distribution are user-defined distribution policies. Values in a distribution key are within a certain range or fall into a specific value range of the corresponding target DN. The two distribution modes facilitate flexible data management, which, however, requires users equipped with certain data abstraction capability. For details, see Table 1.
Policy |
Description |
Application Scenario |
---|---|---|
Hash |
Table data is distributed on all DNs in the cluster. |
Fact tables containing a large amount of data |
Replication |
Full data in a table is stored on every DN in the cluster. |
Small tables and dimension tables |
Range |
Table data is mapped to specified columns based on the range and distributed to the corresponding DNs. |
Users need to customize distribution rules. |
List |
Table data is mapped to specified columns based on specific values and distributed to corresponding DNs. |
Users need to customize distribution rules. |
As shown in Figure 1, T1 is a replication table and T2 is a hash table.
- When you insert, modify, or delete data in a replication table, if you use the shippable or immutable function to encapsulate components that cannot be pushed down, data on different DNs in the replication table may be inconsistent.
- If statements with unstable results, such as window functions, user-defined functions, and ROWNUM and limit clauses are used to insert or modify data in a replication table, data on different nodes may be different.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot