Step 5: Selecting a Distribution Key
If your table is distributed using hash, select a proper distribution key to prevent data skew or poor I/O performance on certain DNs.
You are advised to select a distribution key for each hash table based on the following rules:
- The values of the distribution column should be discrete so that data can be evenly distributed on each DN. For example, select the primary key of a hash table as a distribution column or the ID card number as the distribution column in an employee information table.
- Do not select the column where a constant filter exists. For example, if a constant constraint (for example, zqdh= '000001') exists on the zqdh column in some queries on the dwcjk table, you are not advised to use zqdh as the distribution column.
- Select the join conditions in a query as distribution keys. In this way, Join tasks can be pushed down to DNs for execution, and the communication data between DNs is reduced.
Based on the excellent practice rules, select the primary key of each table as the hash table distribution key.
| Table Name | Number of Records | Distribution Mode | Distribution Key |
|---|---|---|---|
| Store_Sales | 287997024 | Hash | ss_item_sk |
| Date_Dim | 73049 | Replication | - |
| Store | 402 | Replication | - |
| Item | 204000 | Replication | - |
| Time_Dim | 86400 | Replication | - |
| Promotion | 1000 | Replication | - |
| Customer_Demographics | 1920800 | Hash | cd_demo_sk |
| Customer_Address | 1000000 | Hash | ca_address_sk |
| Household_Demographics | 7200 | Replication | - |
| Customer | 1981703 | Hash | c_customer_sk |
| Income_Band | 20 | Replication | - |
Last Article: Step 4: Selecting Distribution Modes
Next Article: Step 6: Creating Another Table and Loading Data
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.