Case: Selecting an Appropriate Distribution Column
Symptom
Tables are defined as follows:
1 2 | CREATE TABLE t1 (a int, b int);
CREATE TABLE t2 (a int, b int);
|
The following query is executed:
1 | SELECT * FROM t1, t2 WHERE t1.a = t2.b;
|
Optimization Analysis
If a is the distribution column of t1 and t2:
1 2 | CREATE TABLE t1 (a int, b int) DISTRIBUTE BY HASH (a);
CREATE TABLE t2 (a int, b int) DISTRIBUTE BY HASH (a);
|
Then Streaming exists in the execution plan and the data volume is heavy among DNs, as shown in Figure 1.
If a is the distribution column of t1 and b is the distribution column of t2:
1 2 | CREATE TABLE t1 (a int, b int) DISTRIBUTE BY HASH (a);
CREATE TABLE t2 (a int, b int) DISTRIBUTE BY HASH (b);
|
Then Streaming does not exist in the execution plan, and the data volume among DNs is decreasing and the query performance is increasing, as shown in Figure 2.
Last Article: Optimization Cases
Next Article: Case: Creating an Appropriate Index


Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.