多表连接场景下DISTINCT优化

对于多表连接+DISTINCT场景，MySQL 8.0需要扫描表连接后的结果，当表连接数量多或基表数据量大时，需要扫描的数据量很大，导致执行效率很低。

为了提升DISTINCT，尤其多表连接下DISTINCT的查询效率，GaussDB(for MySQL)在执行优化器中加入了剪枝功能，可以去除不必要的扫描分支，提升查询性能。

适用场景

Nested Loop Inner Join + Distinct
Nested Loop Outer Join + Distinct

使用须知

内核版本大于等于2.0.51.240300时可使用该功能。

开启多表连接DISTINCT优化

表1 参数说明
参数名称	级别	描述
rds_nlj_distinct_optimize	Global，Session	DISTINCT优化特性开关，默认值为OFF。 ON：开启DISTINCT优化特性。 OFF：关闭DISTINCT优化特性。

除了使用上述开关来控制优化特性生效或者不生效，还可以使用HINT来实现，语法如下。

开启DISTINCT优化特性
 /*+ SET_VAR(rds_nlj_distinct_optimize=ON) */
关闭 DISTINCT 优化特性
 /*+ SET_VAR(rds_nlj_distinct_optimize=OFF) */

使用示例

使用如下任意方式开启DISTINCT优化特性。

通过SET命令设置此开关值。

mysql> SET rds_nlj_distinct_optimize=ON;
Query OK, 0 rows affected (0.00 sec)

mysql> SET rds_nlj_distinct_optimize=OFF;
Query OK, 0 rows affected (0.00 sec)

通过HINT方式在SQL语句中设置开关值。

mysql> EXPLAIN ANALYZE SELECT/*+ SET_VAR(rds_nlj_distinct_optimize=ON) */ DISTINCT tt1.a FROM t1 AS tt1 JOIN t1 AS tt2 JOIN t1 AS tt3 ON tt2.a + 3 = tt3.a;

mysql> EXPLAIN ANALYZE SELECT/*+ SET_VAR(rds_nlj_distinct_optimize=OFF) */ DISTINCT tt1.a FROM t1 AS tt1 JOIN t1 AS tt2 JOIN t1 AS tt3 ON tt2.a + 3 = tt3.a;

确认多表连接场景下DISTINCT优化效果。

通过执行Explain Analyze/Explain Format=tree语句可以确认优化是否生效，执行计划出现'with distinct optimization'关键字时，说明优化生效。

具体步骤如下：

准备数据。

CREATE TABLE t1(a INT, KEY(a));
INSERT INTO t1 VALUES(1),(2),(5),(6),(7),(8),(9),(11);
ANALYZE TABLE t1;

关闭特性，执行以下SQL语句，优化器选择默认的执行计划。

mysql> SET rds_nlj_distinct_optimize=OFF;
Query OK, 0 rows affected (0.00 sec)

mysql> EXPLAIN FORMAT=TREE SELECT DISTINCT tt1.a FROM t1 AS tt1 LEFT JOIN t1 AS tt2 ON TRUE LEFT JOIN t1 AS tt3 ON tt2.a + 3 = tt3.a\G
*************************** 1. row ***************************
EXPLAIN: -> Table scan on <temporary>
    -> Temporary table with deduplication  (cost=29.18 rows=64)
        -> Nested loop left join  (cost=29.18 rows=64)
            -> Left hash join (no condition)  (cost=6.78 rows=64)
                -> Index scan on tt1 using a  (cost=1.05 rows=8)
                -> Hash
                    -> Index scan on tt2 using a  (cost=0.13 rows=8)
            -> Filter: ((tt2.a + 3) = tt3.a)  (cost=0.25 rows=1)
                -> Index lookup on tt3 using a (a=(tt2.a + 3))  (cost=0.25 rows=1)

开启特性，执行以下SQL语句，执行计划中可以看到有"with distinct optimization"关键字，说明此优化生效。

mysql> SET rds_nlj_distinct_optimize=ON;
Query OK, 0 rows affected (0.00 sec)

mysql> EXPLAIN FORMAT=TREE SELECT DISTINCT tt1.a FROM t1 AS tt1 LEFT JOIN t1 AS tt2 ON TRUE LEFT JOIN t1 AS tt3 ON tt2.a + 3 = tt3.a\G
*************************** 1. row ***************************
EXPLAIN: -> Table scan on <temporary>
    -> Temporary table with deduplication  (cost=29.18 rows=64)
        -> Nested loop left join with distinct optimization  (cost=29.18 rows=64)
            -> Left hash join (no condition)  (cost=6.78 rows=64)
                -> Index scan on tt1 using a  (cost=1.05 rows=8)
                -> Hash
                    -> Index scan on tt2 using a  (cost=0.13 rows=8)
            -> Filter: ((tt2.a + 3) = tt3.a)  (cost=0.25 rows=1)
                -> Index lookup on tt3 using a (a=(tt2.a + 3))  (cost=0.25 rows=1)

性能测试

GaussDB(for MySQL)执行耗时2.7秒完成，只需要扫描约61万行数据。相比MySQL 8.0 社区版本执行耗时约186秒，扫描数据量4400万，执行效率大大提升。

如下示例中，对7个表连接后的结果做DISTINCT，使用MySQL 8.0.30社区版本，执行耗时186秒，扫描了约4400万行数据。

GaussDB(for MySQL)执行耗时2.7秒，扫描约61万行数据。

查询语句：

select distinct ed.code,et.*
from ele_template et
left join ele_template_tenant ett on ett.template_id = et.id
left join ele_relation tm on tm.tom_id = et.id and tm.jerry_type = 'chapter'
left join ele_relation mv on mv.tom_id = tm.jerry_id and mv.jerry_type = 'variable'
left join ele_relation cv on cv.jerry_id = mv.jerry_id and cv.tom_type = 'column'
left join ele_doc_column edc on edc.id = cv.tom_id
left join ele_doc ed on ed.id = edc.doc_id
where ett.uctenantid = 'mmo0l3f8'
and ed.code = 'contract'
and et.billtype = 'contract'
order by ifnull(et.utime,et.ctime)
desc limit 0,10;

执行计划：

+----+-------------+-------+------------+--------+-------------------------+-----------------+---------+----------------------+------+----------+----------------------------------------------+
| id | select_type | table | partitions | type   | possible_keys           | key             | key_len | ref                  | rows | filtered | Extra                                        |
+----+-------------+-------+------------+--------+-------------------------+-----------------+---------+----------------------+------+----------+----------------------------------------------+
|  1 | SIMPLE      | ed    | NULL       | ref    | PRIMARY,idx_code        | idx_code        | 203     | const                |    1 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | ett   | NULL       | ref    | PRIMARY,idx_uctenanatid | idx_uctenanatid | 203     | const                |  352 |   100.00 | Using index                                  |
|  1 | SIMPLE      | et    | NULL       | eq_ref | PRIMARY,idx_billtype    | PRIMARY         | 8       | test.ett.template_id |    1 |    94.57 | Using where                                  |
|  1 | SIMPLE      | tm    | NULL       | ref    | idx_tom_id,idx_jerry_id | idx_tom_id      | 9       | test.ett.template_id |   59 |    10.00 | Using index condition; Using where; Distinct |
|  1 | SIMPLE      | mv    | NULL       | ref    | idx_tom_id,idx_jerry_id | idx_tom_id      | 9       | test.tm.jerry_id     |   59 |    10.00 | Using where; Distinct                        |
|  1 | SIMPLE      | cv    | NULL       | ref    | idx_tom_id,idx_jerry_id | idx_jerry_id    | 9       | test.mv.jerry_id     |   47 |    10.00 | Using where; Distinct                        |
|  1 | SIMPLE      | edc   | NULL       | eq_ref | PRIMARY,idx_doc_id      | PRIMARY         | 8       | test.cv.tom_id       |    1 |    50.00 | Using where; Distinct                        |
+----+-------------+-------+------------+--------+-------------------------+-----------------+---------+----------------------+------+----------+----------------------------------------------+

图1 执行耗时对比