Help Center > > Developer Guide> Query Performance Optimization> Optimization Cases> Case: Selecting an Appropriate Distribution Column

Case: Selecting an Appropriate Distribution Column

Updated at: Jul 15, 2020 GMT+08:00

Symptom

Tables are defined as follows:

1
2
CREATE TABLE t1 (a int, b int);
CREATE TABLE t2 (a int, b int);

The following query is executed:

1
SELECT * FROM t1, t2 WHERE t1.a = t2.b;

Optimization Analysis

If a is the distribution column of t1 and t2:

1
2
CREATE TABLE t1 (a int, b int) DISTRIBUTE BY HASH (a);
CREATE TABLE t2 (a int, b int) DISTRIBUTE BY HASH (a);

Then Streaming exists in the execution plan and the data volume is heavy among DNs, as shown in Figure 1.

Figure 1 Selecting an appropriate distribution column (1)

If a is the distribution column of t1 and b is the distribution column of t2:

1
2
CREATE TABLE t1 (a int, b int) DISTRIBUTE BY HASH (a);
CREATE TABLE t2 (a int, b int) DISTRIBUTE BY HASH (b);

Then Streaming does not exist in the execution plan, and the data volume among DNs is decreasing and the query performance is increasing, as shown in Figure 2.

Figure 2 Selecting an appropriate distribution column (2)

Did you find this page helpful?

Submit successfully!

Thank you for your feedback. Your feedback helps make our documentation better.

Failed to submit the feedback. Please try again later.

Which of the following issues have you encountered?







Please complete at least one feedback item.

Content most length 200 character

Content is empty.

OK Cancel