Case: Using Global Secondary Indexes to Accelerate Query
Index-only Scanning of Global Secondary Indexes
When non-distribution keys in the base table are queried, DN pushdown query or index scanning cannot be performed, deteriorating performance problems.
- Scenario 1: common query
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
gaussdb=# CREATE TABLE t (a INT, b INT, c INT) DISTRIBUTE BY HASH(a); CREATE TABLE gaussdb=# INSERT INTO t VALUES(generate_series(1,1000), generate_series(1,1000), generate_series(1,1000)); INSERT 0 1000 gaussdb=# CREATE INDEX i ON t(b); CREATE INDEX gaussdb=# SET MAX_DATANODE_FOR_PLAN = 1; SET gaussdb=# SET ENABLE_FAST_QUERY_SHIPPING = TRUE; SET gaussdb=# SET ENABLE_LIGHT_PROXY = TRUE; SET gaussdb=# SET ENABLE_STREAM_OPERATOR = FALSE; SET gaussdb=# -- Point query. The query condition is not a distribution key. Push down all DNs and use ordinary indexes for query on DNs. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b = 1; QUERY PLAN ---------------------------------------------------- Data Node Scan Output: t.b Node/s: All datanodes Remote query: SELECT b FROM public.t WHERE b = 1 Remote SQL: SELECT b FROM public.t WHERE b = 1 Datanode Name: datanode1 [Bypass] Index Only Scan using i on public.t Output: b Index Cond: (t.b = 1) (12 rows) gaussdb=# -- Sequentially scan all DNs in a range query. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b <= 10; QUERY PLAN ------------------------------------------------------ Data Node Scan Output: t.b Node/s: All datanodes Remote query: SELECT b FROM public.t WHERE b <= 10 Remote SQL: SELECT b FROM public.t WHERE b <= 10 Datanode Name: datanode1 Seq Scan on public.t Output: b Filter: (t.b <= 10) (11 rows)
You can create a global secondary index to push down the preceding query to DNs and use index scanning to accelerate the query. (To use the GSI to optimize query, enable ENABLE_GSISCAN parameter and ENABLE_FAST_QUERY_SHIPPING parameter.)gaussdb=# DROP INDEX i; DROP INDEX gaussdb=# -- Create a global secondary index. gaussdb=# CREATE GLOBAL INDEX gsi_t ON t(b); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'b' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE GLOBAL INDEX gaussdb=# -- Point query. The query condition falls on the distribution key of the global secondary index. A single DN is pushed down and the index is used to scan the query. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b = 1; QUERY PLAN ------------------------------------------------------------- Data Node Scan Output: gsi_t.b Node/s: datanode1 Remote query: SELECT b FROM ONLY public.gsi_t WHERE b = 1 Remote SQL: SELECT b FROM ONLY public.gsi_t WHERE b = 1 Datanode Name: datanode1 [Bypass] Index Only Scan using gsi_t on public.gsi_t Output: b Index Cond: (gsi_t.b = 1) (12 rows) gaussdb=# -- Range query. All DNs are pushed down and the index is used to scan the query. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b <= 10; QUERY PLAN --------------------------------------------------------------- Data Node Scan Output: gsi_t.b Node/s: All datanodes Remote query: SELECT b FROM ONLY public.gsi_t WHERE b <= 10 Remote SQL: SELECT b FROM ONLY public.gsi_t WHERE b <= 10 Datanode Name: datanode1 [Bypass] Index Only Scan using gsi_t on public.gsi_t Output: b Index Cond: (gsi_t.b <= 10) (12 rows)
- Scenario 2: Multi-table join query.
gaussdb=# DCREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1); CREATE TABLE gaussdb=# DINSERT INTO t1 VALUES (GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10)); INSERT 0 10 gaussdb=# DCREATE TABLE t2(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1); CREATE TABLE gaussdb=# DINSERT INTO t2 VALUES (3, GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10)); INSERT 0 10 gaussdb=# -- Create an ordinary index. gaussdb=# CREATE INDEX i_t1 ON t1(c2); CREATE INDEX gaussdb=# -- Join two tables. The STREAM plan is used for query and no DN is pushed down. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t1.c2 FROM t1, t2 WHERE t1.c2 = t2.c1; QUERY PLAN ----------------------------------------------------------------- Streaming (type: GATHER) Output: t1.c2 Node/s: All datanodes -> Hash Join Output: t1.c2 Hash Cond: (t2.c1 = t1.c2) -> Seq Scan on public.t2 Output: t2.c1, t2.c2, t2.c3, t2.c4, t2.c5 Distribute Key: t2.c1 -> Hash Output: t1.c2 -> Streaming(type: REDISTRIBUTE) Output: t1.c2 Distribute Key: t1.c2 Spawn on: All datanodes Consumer Nodes: All datanodes -> Index Only Scan using i_t1 on public.t1 Output: t1.c2 Distribute Key: t1.c1 (19 rows)
You can create a global secondary index on the associated column to push down the query to a DN.
gaussdb=# : Create a GSI. gaussdb=# CREATE GLOBAL INDEX gsi_t1 ON t1(c2); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE GLOBAL INDEX gaussdb=# -- Join two tables and use the GSI to push down all DNs. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t1.c2 FROM t1, t2 WHERE t1.c2 = t2.c1; QUERY PLAN --------------------------------------------------------------------------------------------- Data Node Scan Output: gsi_t1.c2 Node/s: All datanodes Remote query: SELECT gsi_t1.c2 FROM ONLY public.gsi_t1, public.t2 WHERE gsi_t1.c2 = t2.c1 Remote SQL: SELECT gsi_t1.c2 FROM ONLY public.gsi_t1, public.t2 WHERE gsi_t1.c2 = t2.c1 Datanode Name: datanode1 Hash Join Output: gsi_t1.c2 Hash Cond: (t2.c1 = gsi_t1.c2) -> Seq Scan on public.t2 Output: t2.c1, t2.c2, t2.c3, t2.c4, t2.c5 -> Hash Output: gsi_t1.c2 -> Index Only Scan using gsi_t1 on public.gsi_t1 Output: gsi_t1.c2 (17 rows)
- Scenario 3: Subquery.
gaussdb=# CREATE TABLE t3(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1); CREATE TABLE gaussdb=# INSERT INTO t3 VALUES (GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10)); INSERT 0 10 gaussdb=# CREATE TABLE t4(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1); CREATE TABLE gaussdb=# INSERT INTO t4 VALUES (3, GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10)); INSERT 0 10 gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1; QUERY PLAN --------------------------------------------- Streaming (type: GATHER) Output: t4.c2 Node/s: All datanodes -> Hash Join Output: t4.c2 Hash Cond: (t3.c2 = t4.c1) -> Streaming(type: REDISTRIBUTE) Output: t3.c2 Distribute Key: t3.c2 Spawn on: All datanodes Consumer Nodes: All datanodes -> Seq Scan on public.t3 Output: t3.c2 Distribute Key: t3.c1 -> Hash Output: t4.c2, t4.c1 -> Seq Scan on public.t4 Output: t4.c2, t4.c1 Distribute Key: t4.c1 (19 rows)
You can create a global secondary index to push down the query to a DN. In this case, GSI force hint may be increased in the query to use the global secondary index.
gaussdb=# CREATE GLOBAL INDEX gsi_t3 ON t3(c2) CONTAINING(c3, c4) DISTRIBUTE BY HASH(c2); CREATE GLOBAL INDEX gaussdb=# -- Use GSI force hint to push down all DNs. gaussdb=# -- GSI force hint is not applicable and pushdown cannot be performed. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1; QUERY PLAN ------------------------------------------------------------- Streaming (type: GATHER) Output: t4.c2 Node/s: All datanodes -> Hash Join Output: t4.c2 Hash Cond: (t4.c1 = t3.c2) -> Seq Scan on public.t4 Output: t4.c1, t4.c2, t4.c3, t4.c4, t4.c5 Distribute Key: t4.c1 -> Hash Output: t3.c2 -> Index Only Scan using gsi_t3 on public.t3 Output: t3.c2 Distribute Key: t3.c2 (14 rows) gaussdb=# -- Use GSI force hint to push down all DNs. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT /*+gsi()*/ t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Data Node Scan Output: t4.c2 Node/s: All datanodes Remote query: SELECT/*+ gsi()*/ t4.c2 FROM public.t4, (SELECT gsi_t3.c2 FROM ONLY public.gsi_t3) aa WHERE aa.c2 = t4.c1 Remote SQL: SELECT/*+ gsi()*/ t4.c2 FROM public.t4, (SELECT gsi_t3.c2 FROM ONLY public.gsi_t3) aa WHERE aa.c2 = t4.c1 Datanode Name: datanode1 Hash Join Output: t4.c2 Hash Cond: (t4.c1 = gsi_t3.c2) -> Seq Scan on public.t4 Output: t4.c1, t4.c2, t4.c3, t4.c4, t4.c5 -> Hash Output: gsi_t3.c2 -> Index Only Scan using gsi_t3 on public.gsi_t3 Output: gsi_t3.c2 (17 rows)
Global Secondary Index Scanning for Table Access by Index Row ID
- STREAM distributed plan: Create a base table of the global secondary index. During query, the global secondary index needs to be returned to the table. If the predicate hits a large number of rows (for example, range query), the STREAM distributed plan is preferred. Example:
gaussdb=# -- Enable the STEAM mode. gaussdb=# SET ENABLE_STREAM_OPERATOR = ON; SET gaussdb=# SET EXPLAIN_PERF_MODE=PRETTY; SET gaussdb=# CREATE TABLE test(c1 INT, c2 INT, c3 INT); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c1' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE TABLE gaussdb=# -- Create a global secondary index on column c2 for the test table. gaussdb=# CREATE GLOBAL INDEX gsi_test ON test(c2); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE GLOBAL INDEX gaussdb=# -- Use the global secondary index for STREAM distributed plan. gaussdb=# EXPLAIN(VERBOSE ON, COSTS OFF) SELECT /*+ gsitable(test gsi_test)*/ * FROM test WHERE c2 > 1; WARNING: Statistics in some tables or columns(public.test.c1, public.test.c2) are not collected. HINT: Do analyze for them in order to generate optimized plan. id | operation ----+---------------------------------------------------------- 1 | -> Streaming (type: GATHER) 2 | -> GSI Tid Scan on public.test 3 | -> Streaming(type: REDISTRIBUTE GSI) 4 | -> GSI Only Scan using gsi_test on public.test (4 rows) Predicate Information (identified by plan id) --------------------------------------------------- 4 --GSI Only Scan using gsi_test on public.test Index Cond: (test.c2 > 1) (2 rows) Targetlist Information (identified by plan id) --------------------------------------------------- 1 --Streaming (type: GATHER) Output: c1, c2, c3 Node/s: All datanodes 2 --GSI Tid Scan on public.test Output: c1, c2, c3 Distribute Key: c1 3 --Streaming(type: REDISTRIBUTE GSI) Output: xc_node_hash, ctid Spawn on: All datanodes Consumer Nodes: All datanodes 4 --GSI Only Scan using gsi_test on public.test Output: xc_node_hash, ctid Distribute Key: c2 (13 rows)
- Non-STREAM distributed plan: Create a base table of the global secondary index. During query, the global secondary index needs to be returned to the table. If the predicate hits a few rows (for example, point query), the non-STREAM distributed plan is preferred. Example:
gaussdb=# -- Disable the STEAM mode. gaussdb=# SET ENABLE_STREAM_OPERATOR = OFF; SET gaussdb=# CREATE TABLE test(c1 INT, c2 INT, c3 INT); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c1' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE TABLE gaussdb=# -- Create a global secondary index on column c2 for the test table. gaussdb=# CREATE GLOBAL INDEX gsi_test ON test(c2); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE GLOBAL INDEX gaussdb=# -- Use the global secondary index for non-STREAM distributed plan. gaussdb=# EXPLAIN(VERBOSE ON, COSTS OFF) SELECT /*+ gsitable(test gsi_test)*/ * FROM test WHERE c2 = 1; QUERY PLAN ----------------------------------------------------------------------------------- Select on public.test Output: test.c1, test.c2, test.c3 Node/s: All datanodes Remote query: SELECT c1, c2, c3 FROM ONLY public.test WHERE ctid = $2 -> Data Node Scan on gsi_test "REMOTE_TABLE_QUERY" Output: gsi_test.xc_node_hash, gsi_test.ctid Node/s: datanode1 Remote query: SELECT xc_node_hash, ctid FROM public.gsi_test WHERE c2 = 1 (8 rows)
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.