Case: Using Global Secondary Indexes to Accelerate Query
Index-only Scanning of Global Secondary Indexes
When non-distribution keys in the base table are queried, DN pushdown query or index scanning cannot be performed, deteriorating performance problems.
- Scenario 1: common query
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
gaussdb=# CREATE TABLE t (a INT, b INT, c INT) DISTRIBUTE BY HASH(a); CREATE TABLE gaussdb=# INSERT INTO t VALUES(generate_series(1,1000), generate_series(1,1000), generate_series(1,1000)); INSERT 0 1000 gaussdb=# CREATE INDEX i ON t(b); CREATE INDEX gaussdb=# SET MAX_DATANODE_FOR_PLAN = 1; SET gaussdb=# SET ENABLE_FAST_QUERY_SHIPPING = TRUE; SET gaussdb=# SET ENABLE_LIGHT_PROXY = TRUE; SET gaussdb=# SET ENABLE_STREAM_OPERATOR = FALSE; SET gaussdb=# -- Point query. The query condition is not a distribution key. Push down all DNs and use ordinary indexes for query on DNs. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b = 1; QUERY PLAN ---------------------------------------------------- Data Node Scan Output: t.b Node/s: All datanodes Remote query: SELECT b FROM public.t WHERE b = 1 Remote SQL: SELECT b FROM public.t WHERE b = 1 Datanode Name: datanode1 [Bypass] Index Only Scan using i on public.t Output: b Index Cond: (t.b = 1) (12 rows) gaussdb=# -- Sequentially scan all DNs in a range query. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b <= 10; QUERY PLAN ------------------------------------------------------ Data Node Scan Output: t.b Node/s: All datanodes Remote query: SELECT b FROM public.t WHERE b <= 10 Remote SQL: SELECT b FROM public.t WHERE b <= 10 Datanode Name: datanode1 Seq Scan on public.t Output: b Filter: (t.b <= 10) (11 rows)
You can create a global secondary index to push down the preceding query to DNs and use index scanning to accelerate the query. (To use the GSI to optimize query, enable ENABLE_GSISCAN parameter and ENABLE_FAST_QUERY_SHIPPING parameter.)gaussdb=# DROP INDEX i; DROP INDEX gaussdb=# -- Create a global secondary index. gaussdb=# CREATE GLOBAL INDEX gsi_t ON t(b); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'b' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE GLOBAL INDEX gaussdb=# -- Point query. The query condition falls on the distribution key of the global secondary index. A single DN is pushed down and the index is used to scan the query. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b = 1; QUERY PLAN ------------------------------------------------------------- Data Node Scan Output: gsi_t.b Node/s: datanode1 Remote query: SELECT b FROM ONLY public.gsi_t WHERE b = 1 Remote SQL: SELECT b FROM ONLY public.gsi_t WHERE b = 1 Datanode Name: datanode1 [Bypass] Index Only Scan using gsi_t on public.gsi_t Output: b Index Cond: (gsi_t.b = 1) (12 rows) gaussdb=# -- Range query. All DNs are pushed down and the index is used to scan the query. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT b FROM t WHERE b <= 10; QUERY PLAN --------------------------------------------------------------- Data Node Scan Output: gsi_t.b Node/s: All datanodes Remote query: SELECT b FROM ONLY public.gsi_t WHERE b <= 10 Remote SQL: SELECT b FROM ONLY public.gsi_t WHERE b <= 10 Datanode Name: datanode1 [Bypass] Index Only Scan using gsi_t on public.gsi_t Output: b Index Cond: (gsi_t.b <= 10) (12 rows)
- Scenario 2: Multi-table join query.
gaussdb=# DCREATE TABLE t1(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1); CREATE TABLE gaussdb=# DINSERT INTO t1 VALUES (GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10)); INSERT 0 10 gaussdb=# DCREATE TABLE t2(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1); CREATE TABLE gaussdb=# DINSERT INTO t2 VALUES (3, GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10)); INSERT 0 10 gaussdb=# -- Create an ordinary index. gaussdb=# CREATE INDEX i_t1 ON t1(c2); CREATE INDEX gaussdb=# -- Join two tables. The STREAM plan is used for query and no DN is pushed down. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t1.c2 FROM t1, t2 WHERE t1.c2 = t2.c1; QUERY PLAN ----------------------------------------------------------------- Streaming (type: GATHER) Output: t1.c2 Node/s: All datanodes -> Hash Join Output: t1.c2 Hash Cond: (t2.c1 = t1.c2) -> Seq Scan on public.t2 Output: t2.c1, t2.c2, t2.c3, t2.c4, t2.c5 Distribute Key: t2.c1 -> Hash Output: t1.c2 -> Streaming(type: REDISTRIBUTE) Output: t1.c2 Distribute Key: t1.c2 Spawn on: All datanodes Consumer Nodes: All datanodes -> Index Only Scan using i_t1 on public.t1 Output: t1.c2 Distribute Key: t1.c1 (19 rows)
You can create a global secondary index on the associated column to push down the query to a DN.
gaussdb=# : Create a GSI. gaussdb=# CREATE GLOBAL INDEX gsi_t1 ON t1(c2); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE GLOBAL INDEX gaussdb=# -- Join two tables and use the GSI to push down all DNs. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t1.c2 FROM t1, t2 WHERE t1.c2 = t2.c1; QUERY PLAN --------------------------------------------------------------------------------------------- Data Node Scan Output: gsi_t1.c2 Node/s: All datanodes Remote query: SELECT gsi_t1.c2 FROM ONLY public.gsi_t1, public.t2 WHERE gsi_t1.c2 = t2.c1 Remote SQL: SELECT gsi_t1.c2 FROM ONLY public.gsi_t1, public.t2 WHERE gsi_t1.c2 = t2.c1 Datanode Name: datanode1 Hash Join Output: gsi_t1.c2 Hash Cond: (t2.c1 = gsi_t1.c2) -> Seq Scan on public.t2 Output: t2.c1, t2.c2, t2.c3, t2.c4, t2.c5 -> Hash Output: gsi_t1.c2 -> Index Only Scan using gsi_t1 on public.gsi_t1 Output: gsi_t1.c2 (17 rows)
- Scenario 3: Subquery.
gaussdb=# CREATE TABLE t3(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1); CREATE TABLE gaussdb=# INSERT INTO t3 VALUES (GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10)); INSERT 0 10 gaussdb=# CREATE TABLE t4(c1 INT, c2 INT, c3 INT, c4 INT, c5 INT) DISTRIBUTE BY HASH(c1); CREATE TABLE gaussdb=# INSERT INTO t4 VALUES (3, GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10), GENERATE_SERIES(1,10)); INSERT 0 10 gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1; QUERY PLAN --------------------------------------------- Streaming (type: GATHER) Output: t4.c2 Node/s: All datanodes -> Hash Join Output: t4.c2 Hash Cond: (t3.c2 = t4.c1) -> Streaming(type: REDISTRIBUTE) Output: t3.c2 Distribute Key: t3.c2 Spawn on: All datanodes Consumer Nodes: All datanodes -> Seq Scan on public.t3 Output: t3.c2 Distribute Key: t3.c1 -> Hash Output: t4.c2, t4.c1 -> Seq Scan on public.t4 Output: t4.c2, t4.c1 Distribute Key: t4.c1 (19 rows)
You can create a global secondary index to push down the query to a DN. In this case, GSI force hint may be increased in the query to use the global secondary index.
gaussdb=# CREATE GLOBAL INDEX gsi_t3 ON t3(c2) CONTAINING(c3, c4) DISTRIBUTE BY HASH(c2); CREATE GLOBAL INDEX gaussdb=# -- Use GSI force hint to push down all DNs. gaussdb=# -- GSI force hint is not applicable and pushdown cannot be performed. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1; QUERY PLAN ------------------------------------------------------------- Streaming (type: GATHER) Output: t4.c2 Node/s: All datanodes -> Hash Join Output: t4.c2 Hash Cond: (t4.c1 = t3.c2) -> Seq Scan on public.t4 Output: t4.c1, t4.c2, t4.c3, t4.c4, t4.c5 Distribute Key: t4.c1 -> Hash Output: t3.c2 -> Index Only Scan using gsi_t3 on public.t3 Output: t3.c2 Distribute Key: t3.c2 (14 rows) gaussdb=# -- Use GSI force hint to push down all DNs. gaussdb=# EXPLAIN (VERBOSE ON, COSTS OFF) SELECT /*+gsi()*/ t4.c2 FROM t4, (SELECT c2 FROM t3) aa WHERE aa.c2 = t4.c1; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Data Node Scan Output: t4.c2 Node/s: All datanodes Remote query: SELECT/*+ gsi()*/ t4.c2 FROM public.t4, (SELECT gsi_t3.c2 FROM ONLY public.gsi_t3) aa WHERE aa.c2 = t4.c1 Remote SQL: SELECT/*+ gsi()*/ t4.c2 FROM public.t4, (SELECT gsi_t3.c2 FROM ONLY public.gsi_t3) aa WHERE aa.c2 = t4.c1 Datanode Name: datanode1 Hash Join Output: t4.c2 Hash Cond: (t4.c1 = gsi_t3.c2) -> Seq Scan on public.t4 Output: t4.c1, t4.c2, t4.c3, t4.c4, t4.c5 -> Hash Output: gsi_t3.c2 -> Index Only Scan using gsi_t3 on public.gsi_t3 Output: gsi_t3.c2 (17 rows)
Global Secondary Index Scanning for Table Access by Index Row ID
- STREAM distributed plan: Create a base table of the global secondary index. During query, the global secondary index needs to be returned to the table. If the predicate hits a large number of rows (for example, range query), the STREAM distributed plan is preferred. Example:
gaussdb=# -- Enable the STEAM mode. gaussdb=# SET ENABLE_STREAM_OPERATOR = ON; SET gaussdb=# SET EXPLAIN_PERF_MODE=PRETTY; SET gaussdb=# CREATE TABLE test(c1 INT, c2 INT, c3 INT); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c1' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE TABLE gaussdb=# -- Create a global secondary index on column c2 for the test table. gaussdb=# CREATE GLOBAL INDEX gsi_test ON test(c2); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE GLOBAL INDEX gaussdb=# -- Use the global secondary index for STREAM distributed plan. gaussdb=# EXPLAIN(VERBOSE ON, COSTS OFF) SELECT /*+ gsitable(test gsi_test)*/ * FROM test WHERE c2 > 1; WARNING: Statistics in some tables or columns(public.test.c1, public.test.c2) are not collected. HINT: Do analyze for them in order to generate optimized plan. id | operation ----+---------------------------------------------------------- 1 | -> Streaming (type: GATHER) 2 | -> GSI Tid Scan on public.test 3 | -> Streaming(type: REDISTRIBUTE GSI) 4 | -> GSI Only Scan using gsi_test on public.test (4 rows) Predicate Information (identified by plan id) --------------------------------------------------- 4 --GSI Only Scan using gsi_test on public.test Index Cond: (test.c2 > 1) (2 rows) Targetlist Information (identified by plan id) --------------------------------------------------- 1 --Streaming (type: GATHER) Output: c1, c2, c3 Node/s: All datanodes 2 --GSI Tid Scan on public.test Output: c1, c2, c3 Distribute Key: c1 3 --Streaming(type: REDISTRIBUTE GSI) Output: xc_node_hash, ctid Spawn on: All datanodes Consumer Nodes: All datanodes 4 --GSI Only Scan using gsi_test on public.test Output: xc_node_hash, ctid Distribute Key: c2 (13 rows)
- Non-STREAM distributed plan: Create a base table of the global secondary index. During query, the global secondary index needs to be returned to the table. If the predicate hits a few rows (for example, point query), the non-STREAM distributed plan is preferred. Example:
gaussdb=# -- Disable the STEAM mode. gaussdb=# SET ENABLE_STREAM_OPERATOR = OFF; SET gaussdb=# CREATE TABLE test(c1 INT, c2 INT, c3 INT); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c1' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE TABLE gaussdb=# -- Create a global secondary index on column c2 for the test table. gaussdb=# CREATE GLOBAL INDEX gsi_test ON test(c2); NOTICE: The 'DISTRIBUTE BY' clause is not specified. Using 'c2' as the distribution column by default. HINT: Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column. CREATE GLOBAL INDEX gaussdb=# -- Use the global secondary index for non-STREAM distributed plan. gaussdb=# EXPLAIN(VERBOSE ON, COSTS OFF) SELECT /*+ gsitable(test gsi_test)*/ * FROM test WHERE c2 = 1; QUERY PLAN ----------------------------------------------------------------------------------- Select on public.test Output: test.c1, test.c2, test.c3 Node/s: All datanodes Remote query: SELECT c1, c2, c3 FROM ONLY public.test WHERE ctid = $2 -> Data Node Scan on gsi_test "REMOTE_TABLE_QUERY" Output: gsi_test.xc_node_hash, gsi_test.ctid Node/s: datanode1 Remote query: SELECT xc_node_hash, ctid FROM public.gsi_test WHERE c2 = 1 (8 rows)
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot