Case: Selecting an Appropriate Distribution Column
Symptom
Tables are defined as follows:
1 2 |
CREATE TABLE t1 (a int, b int); CREATE TABLE t2 (a int, b int); |
The following query is executed:
1
|
SELECT * FROM t1, t2 WHERE t1.a = t2.b; |
Optimization Analysis
If a is the distribution column of t1 and t2:
1 2 |
CREATE TABLE t1 (a int, b int) DISTRIBUTE BY HASH (a); CREATE TABLE t2 (a int, b int) DISTRIBUTE BY HASH (a); |
Then Streaming exists in the execution plan and the data volume is heavy among DNs, as shown in Figure 1.
If a is the distribution column of t1 and b is the distribution column of t2:
1 2 |
CREATE TABLE t1 (a int, b int) DISTRIBUTE BY HASH (a); CREATE TABLE t2 (a int, b int) DISTRIBUTE BY HASH (b); |
Then Streaming does not exist in the execution plan, and the data volume among DNs is decreasing and the query performance is increasing, as shown in Figure 2.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.