Case: Using Global Secondary Indexes to Accelerate Query

Index-only Scanning of Global Secondary Indexes

When non-distribution keys in the base table are queried, DN pushdown query or index scanning cannot be performed, deteriorating performance problems.

Scenario 1: common query

      
       
         
         
           gaussdb=# CREATE TABLE t (a INT, b INT, c INT) DISTRIBUTE BY HASH(a);
CREATE TABLE
gaussdb=# INSERT INTO t VALUES(generate_series(1,1000), generate_series(1,1000), generate_series(1,1000));
INSERT 0 1000
gaussdb=# CREATE INDEX i ON t(b);
CREATE INDEX
gaussdb=# SET MAX_DATANODE_FOR_PLAN = 1;
SET
gaussdb=# SET ENABLE_FAST_QUERY_SHIPPING = TRUE;
SET
gaussdb=# SET ENABLE_LIGHT_PROXY = TRUE;
SET
gaussdb=# SET ENABLE_STREAM_OPERATOR = FALSE;
SET
gaussdb=# -- Point query. The query condition is not a distribution key. Push down all DNs and use ordinary indexes for query on DNs.
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF)  SELECT b FROM t WHERE b = 1;
                     QUERY PLAN                     
----------------------------------------------------
 Data Node Scan
   Output: t.b
   Node/s: All datanodes
   Remote query: SELECT b FROM public.t WHERE b = 1

 Remote SQL: SELECT b FROM public.t WHERE b = 1
 Datanode Name: datanode1
   [Bypass]
   Index Only Scan using i on public.t
     Output: b
     Index Cond: (t.b = 1)

(12 rows)

gaussdb=#  -- Sequentially scan all DNs in a range query.
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF)  SELECT b FROM t WHERE b <= 10;
                      QUERY PLAN                      
------------------------------------------------------
 Data Node Scan
   Output: t.b
   Node/s: All datanodes
   Remote query: SELECT b FROM public.t WHERE b <= 10

 Remote SQL: SELECT b FROM public.t WHERE b <= 10
 Datanode Name: datanode1
   Seq Scan on public.t
     Output: b
     Filter: (t.b <= 10)

(11 rows)

          

        

      
     

You can create a global secondary index to push down the preceding query to DNs and use index scanning to accelerate the query. (To use the GSI to optimize query, enable ENABLE_GSISCAN parameter and ENABLE_FAST_QUERY_SHIPPING parameter.)

gaussdb=# DROP INDEX i;
DROP INDEX
gaussdb=# -- Create a global secondary index.
gaussdb=# CREATE GLOBAL INDEX gsi_t ON t(b);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'b' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE GLOBAL INDEX
gaussdb=#  -- Point query. The query condition falls on the distribution key of the global secondary index. A single DN is pushed down and the index is used to scan the query.
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b = 1;
                         QUERY PLAN                          
-------------------------------------------------------------
 Data Node Scan
   Output: gsi_t.b
   Node/s: datanode1
   Remote query: SELECT b FROM ONLY public.gsi_t WHERE b = 1

 Remote SQL: SELECT b FROM ONLY public.gsi_t WHERE b = 1
 Datanode Name: datanode1
   [Bypass]
   Index Only Scan using gsi_t on public.gsi_t
     Output: b
     Index Cond: (gsi_t.b = 1)

(12 rows)

gaussdb=#  -- Range query. All DNs are pushed down and the index is used to scan the query.
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b <= 10;
                          QUERY PLAN                           
---------------------------------------------------------------
 Data Node Scan
   Output: gsi_t.b
   Node/s: All datanodes
   Remote query: SELECT b FROM ONLY public.gsi_t WHERE b <= 10

 Remote SQL: SELECT b FROM ONLY public.gsi_t WHERE b <= 10
 Datanode Name: datanode1
   [Bypass]
   Index Only Scan using gsi_t on public.gsi_t
     Output: b
     Index Cond: (gsi_t.b <= 10)

(12 rows)

Scenario 2: Multi-table join query.

gaussdb=# DCREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1);
CREATE TABLE
gaussdb=# DINSERT INTO t1 VALUES (GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10));
INSERT 0 10
gaussdb=# DCREATE TABLE t2(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1);
CREATE TABLE
gaussdb=# DINSERT INTO t2 VALUES (3, GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10));
INSERT 0 10
gaussdb=#  -- Create an ordinary index.
gaussdb=# CREATE INDEX i_t1 ON t1(c2);
CREATE INDEX
gaussdb=#  -- Join two tables. The STREAM plan is used for query and no DN is pushed down.
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t1.c2 FROM t1, t2 WHERE t1.c2 = t2.c1;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Streaming (type: GATHER)
   Output: t1.c2
   Node/s: All datanodes
   ->  Hash Join
         Output: t1.c2
         Hash Cond: (t2.c1 = t1.c2)
         ->  Seq Scan on public.t2
               Output: t2.c1, t2.c2, t2.c3, t2.c4, t2.c5
               Distribute Key: t2.c1
         ->  Hash
               Output: t1.c2
               ->  Streaming(type: REDISTRIBUTE)
                     Output: t1.c2
                     Distribute Key: t1.c2
                     Spawn on: All datanodes
                     Consumer Nodes: All datanodes
                     ->  Index Only Scan using i_t1 on public.t1
                           Output: t1.c2
                           Distribute Key: t1.c1
(19 rows)

You can create a global secondary index on the associated column to push down the query to a DN.

gaussdb=# : Create a GSI.
gaussdb=# CREATE GLOBAL INDEX gsi_t1 ON t1(c2);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE GLOBAL INDEX
gaussdb=#  -- Join two tables and use the GSI to push down all DNs.
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t1.c2 FROM t1, t2 WHERE t1.c2 = t2.c1;
                                         QUERY PLAN                                          
---------------------------------------------------------------------------------------------
 Data Node Scan
   Output: gsi_t1.c2
   Node/s: All datanodes
   Remote query: SELECT gsi_t1.c2 FROM ONLY public.gsi_t1, public.t2 WHERE gsi_t1.c2 = t2.c1

 Remote SQL: SELECT gsi_t1.c2 FROM ONLY public.gsi_t1, public.t2 WHERE gsi_t1.c2 = t2.c1
 Datanode Name: datanode1
   Hash Join
     Output: gsi_t1.c2
     Hash Cond: (t2.c1 = gsi_t1.c2)
     ->  Seq Scan on public.t2
           Output: t2.c1, t2.c2, t2.c3, t2.c4, t2.c5
     ->  Hash
           Output: gsi_t1.c2
           ->  Index Only Scan using gsi_t1 on public.gsi_t1
                 Output: gsi_t1.c2

(17 rows)

Scenario 3: Subquery.

gaussdb=# CREATE TABLE t3(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1);
CREATE TABLE
gaussdb=# INSERT INTO t3 VALUES (GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10));
INSERT 0 10
gaussdb=# CREATE TABLE t4(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1);
CREATE TABLE
gaussdb=# INSERT INTO t4 VALUES (3, GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10));
INSERT 0 10
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1;
                 QUERY PLAN                  
---------------------------------------------
 Streaming (type: GATHER)
   Output: t4.c2
   Node/s: All datanodes
   ->  Hash Join
         Output: t4.c2
         Hash Cond: (t3.c2 = t4.c1)
         ->  Streaming(type: REDISTRIBUTE)
               Output: t3.c2
               Distribute Key: t3.c2
               Spawn on: All datanodes
               Consumer Nodes: All datanodes
               ->  Seq Scan on public.t3
                     Output: t3.c2
                     Distribute Key: t3.c1
         ->  Hash
               Output: t4.c2, t4.c1
               ->  Seq Scan on public.t4
                     Output: t4.c2, t4.c1
                     Distribute Key: t4.c1
(19 rows)

You can create a global secondary index to push down the query to a DN. In this case, GSI force hint may be increased in the query to use the global secondary index.

gaussdb=# CREATE GLOBAL INDEX gsi_t3 ON t3(c2) CONTAINING(c3, c4) DISTRIBUTE BY HASH(c2);
CREATE GLOBAL INDEX
gaussdb=#  -- Use GSI force hint to push down all DNs.
gaussdb=# -- GSI force hint is not applicable and pushdown cannot be performed.
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1;
                         QUERY PLAN                          
-------------------------------------------------------------
 Streaming (type: GATHER)
   Output: t4.c2
   Node/s: All datanodes
   ->  Hash Join
         Output: t4.c2
         Hash Cond: (t4.c1 = t3.c2)
         ->  Seq Scan on public.t4
               Output: t4.c1, t4.c2, t4.c3, t4.c4, t4.c5
               Distribute Key: t4.c1
         ->  Hash
               Output: t3.c2
               ->  Index Only Scan using gsi_t3 on public.t3
                     Output: t3.c2
                     Distribute Key: t3.c2
(14 rows)

gaussdb=#  -- Use GSI force hint to push down all DNs.
gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT /*+gsi()*/ t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Data Node Scan
   Output: t4.c2
   Node/s: All datanodes
   Remote query: SELECT/*+ gsi()*/ t4.c2 FROM public.t4, (SELECT gsi_t3.c2 FROM ONLY public.gsi_t3) aa WHERE aa.c2 = t4.c1

 Remote SQL: SELECT/*+ gsi()*/ t4.c2 FROM public.t4, (SELECT gsi_t3.c2 FROM ONLY public.gsi_t3) aa WHERE aa.c2 = t4.c1
 Datanode Name: datanode1
   Hash Join
     Output: t4.c2
     Hash Cond: (t4.c1 = gsi_t3.c2)
     ->  Seq Scan on public.t4
           Output: t4.c1, t4.c2, t4.c3, t4.c4, t4.c5
     ->  Hash
           Output: gsi_t3.c2
           ->  Index Only Scan using gsi_t3 on public.gsi_t3
                 Output: gsi_t3.c2

(17 rows)

Global Secondary Index Scanning for Table Access by Index Row ID

STREAM distributed plan: Create a base table of the global secondary index. During query, the global secondary index needs to be returned to the table. If the predicate hits a large number of rows (for example, range query), the STREAM distributed plan is preferred. Example:

gaussdb=#  -- Enable the STEAM mode.
gaussdb=#  SET ENABLE_STREAM_OPERATOR = ON;
SET
gaussdb=#  SET EXPLAIN_PERF_MODE=PRETTY;
SET
gaussdb=#  CREATE TABLE test(c1 INT, c2 INT, c3 INT);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'c1' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE TABLE
gaussdb=# -- Create a global secondary index on column c2 for the test table.
gaussdb=#  CREATE GLOBAL INDEX gsi_test ON test(c2);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE GLOBAL INDEX
gaussdb=#  -- Use the global secondary index for STREAM distributed plan.
gaussdb=#  EXPLAIN(VERBOSE ON, COSTS OFF) SELECT /*+ gsitable(test gsi_test)*/ * FROM test WHERE c2 > 1;
WARNING:  Statistics in some tables or columns(public.test.c1, public.test.c2) are not collected.
HINT:  Do analyze for them in order to generate optimized plan.
 id |                        operation                         
----+----------------------------------------------------------
  1 | ->  Streaming (type: GATHER)
  2 |    ->  GSI Tid Scan on public.test
  3 |       ->  Streaming(type: REDISTRIBUTE GSI)
  4 |          ->  GSI Only Scan using gsi_test on public.test
(4 rows)
   Predicate Information (identified by plan id)   
---------------------------------------------------
   4 --GSI Only Scan using gsi_test on public.test
         Index Cond: (test.c2 > 1)
(2 rows)
  Targetlist Information (identified by plan id)   
---------------------------------------------------
   1 --Streaming (type: GATHER)
         Output: c1, c2, c3
         Node/s: All datanodes
   2 --GSI Tid Scan on public.test
         Output: c1, c2, c3
         Distribute Key: c1
   3 --Streaming(type: REDISTRIBUTE GSI)
         Output: xc_node_hash, ctid
         Spawn on: All datanodes
         Consumer Nodes: All datanodes
   4 --GSI Only Scan using gsi_test on public.test
         Output: xc_node_hash, ctid
         Distribute Key: c2
(13 rows)

Non-STREAM distributed plan: Create a base table of the global secondary index. During query, the global secondary index needs to be returned to the table. If the predicate hits a few rows (for example, point query), the non-STREAM distributed plan is preferred. Example:

gaussdb=#  -- Disable the STEAM mode.
gaussdb=#  SET ENABLE_STREAM_OPERATOR = OFF;
SET
gaussdb=#  CREATE TABLE test(c1 INT, c2 INT, c3 INT);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'c1' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE TABLE
gaussdb=# -- Create a global secondary index on column c2 for the test table.
gaussdb=#  CREATE GLOBAL INDEX gsi_test ON test(c2);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE GLOBAL INDEX
gaussdb=#  -- Use the global secondary index for non-STREAM distributed plan.
gaussdb=#  EXPLAIN(VERBOSE ON, COSTS OFF) SELECT /*+ gsitable(test gsi_test)*/ * FROM test WHERE c2 = 1;
                                    QUERY PLAN                                     
-----------------------------------------------------------------------------------
 Select on public.test
   Output: test.c1, test.c2, test.c3
   Node/s: All datanodes
   Remote query: SELECT c1, c2, c3 FROM ONLY public.test WHERE ctid = $2
   ->  Data Node Scan on gsi_test "REMOTE_TABLE_QUERY"
         Output: gsi_test.xc_node_hash, gsi_test.ctid
         Node/s: datanode1
         Remote query: SELECT xc_node_hash, ctid FROM public.gsi_test WHERE c2 = 1
(8 rows)

Parent topic: Optimization Cases

Previous topic: Case: Adjusting the GUC Parameter costbased_rewrite_rule for Cost-based Query Rewriting

Next topic: SQL Reference