Best Practices for Using HStore Tables

Storage Mechanism for Column-Store Tables

In DWS, column-store tables use a compression unit (CU) as the smallest storage unit. By default, each CU stores 60,000 rows per column and operates in append write mode. This means that update and deletion operations do not directly modify the original CU. Once a CU is created, its data cannot be altered, leading to the generation of a new complete CU whenever data is inserted, regardless of the quantity.

This approach can lead to several issues:

Deletion: Old data is flagged as obsolete in the dictionary but not released, leading to potential space wastage.
Update operation: New records are written to a new CU after old data is marked as deleted.
Space problem: Frequent updates and deletions may result in tablespace bloat, reducing effective storage utilization.

Advantages of HStore Tables

The HStore table uses an additional delta table to effectively balance storage and updates. Here are the key points:

Batch data processing:
- Insert data in batches and write directly to the CU.
- Maintain the compression efficiency similar to traditional column-based storage.
Incremental data processing:
- Serialize and compress updated column data and perform small-batch insertions.
- Periodically execute the MERGE operation in the background to merge data into the primary table CU.
Storage efficiency:
- Minimize the disk space usage.
- Retain the high compression ratio in column-store format.
Performance:
- Support high-concurrency update operations.
- Offer excellent query response speed.
Scenarios:
- Real-time data import and queries
- Traditional TP transaction processing
- High-concurrency loads involving both updates and queries

To enhance performance, DWS 9.1.0 retains the old HStore table for forward compatibility. The optimized tables are known as HStore Opt tables. HStore tables can be replaced by HStore_opt tables for better performance, except in scenarios requiring high performance without micro-batch updates.

Usage Suggestions

Parameter setting

Set the parameters as follows to ensure good query performance and storage efficiency for the HStore table:

       
            autovacuum_max_workers_hstore=3  # Number of MERGE threads dedicated to HStore
autovacuum_max_workers=6         # Total number of autovacuum threads in the system
autovacuum=true                  # Enable automatic cleanup.
enable_col_index_vacuum=on       # Enable column index cleanup.

Suggestions on importing data to the database (The HStore_opt table is recommended.)
1. Update operations:
  Use the UPSERT mode instead of the UPDATE mode.
2. Deletion operations:
  - Ensure that the execution plan is scanned by index.
  - The JDBC batch mode is the most efficient.
3. Batch data import:
  - If the amount of data to be imported at once exceeds 1 million records and the data is unique, consider using MERGE INTO.
  - UPSERT is recommended for common scenarios.
Point query suggestion (The HStore_opt table is recommended.)
1. Create a level-2 partition on columns with evenly distributed distinct values and frequent equivalent filter criteria. Avoid level-2 partitions on columns with skewed or few distinct values.
2. When dealing with fixed filter criteria columns (excluding level-2 partitions), use the cbtree index (up to 5 columns).
3. When dealing with variable filter criteria columns (excluding level-2 partitions), use the GIN index (up to 5 columns).
4. For all string columns involving equivalent filtering, bitmap indexes can be specified during table creation. The number of columns is not limited, but cannot be modified later.
5. Specify columns that can be filtered by time range as the partition columns.
6. If point queries return over 100,000 records per DN, index scanning may outperform non-index scanning. Use the GUC parameter enable_seqscan to compare performance.
Index-related
1. Indexes consume additional storage space.
2. Create indexes to enhance performance when needed.
3. Indexes are used when UPSERT operations need to be performed.
4. Indexes are used for point queries with unique or near-unique requirements.

MERGE-related
1. Data import speed control:
  - The import speed cannot exceed the MERGE processing capability.
  - Control the concurrency of importing data to the database to prevent the delta table from being expanded.
2. Spatial multiplexing:
  - Spatial multiplexing impacts delta tablespace reuse through oldestXmin.
  - Prolonged transactions may cause delays and expansions.

Performance Comparison

Concurrent Update

Once a batch of data is inserted into a column-store table, two sessions are initiated. In session 1, a piece of data is deleted, and the transaction is not terminated.

        
              CREATE TABLE col(a int , b int)with(orientation=column);
 CREATE TABLE

 INSERT INTO col select * from data;
 INSERT 0 100

 BEGIN;
 BEGIN

 DELETE col where a = 1;
 DELETE 1

When session 2 attempts to delete more data, it becomes evident that session 2 can only proceed after session 1 is committed. This scenario imitates the CU lock issue in column storage.

        
              BEGIN;
 BEGIN
 DELETE col where a = 2;

Repeat the previous experiment using the HStore table. Session 2 can be executed successfully without any lock wait.

        
              BEGIN;
 BEGIN
 DELETE hs where a = 2;
 DELETE 1

Compression efficiency

Create a data table with 3 million data records.

        
         
           
           
              CREATE TABLE data( a int, b bigint, c varchar(10), d varchar(10));
 ​
 CREATE TABLE
 INSERT INTO data values(generate_series(1,100),1,'asdfasdf','gergqer');
 INSERT 0 100
 INSERT INTO data select * from data;
 INSERT 0 100
 INSERT INTO data select * from data;
 INSERT 0 200
 ​
 ---Insert data cyclically until the data volume reaches 3 million.
 ​
 INSERT INTO data select * from data;
 INSERT 0 1638400
 select count(*) from data;
   count
 ---------
  3276800
 (1 row)

            

          

        
       

Import data to a row-store table in batches and check whether the size is 223 MB.

       
             CREATE TABLE row (like data including all);
 CREATE TABLE
 INSERT INTO row select * from data;
 INSERT 0 3276800
 select pg_size_pretty(pg_relation_size('row'));
  pg_size_pretty
 ----------------
  223 MB
 (1 row)

Import data to the HStore Opt table in batches and check whether the size is 3.5 MB.

       
             CREATE TABLE hs(a int, b bigint, c varchar(10),d varchar(10))with(orientation= column, enable_hstore_opt=on);
 CREATE TABLE
 INSERT INTO hs select * from data;
 INSERT 0 3276800
 select pg_size_pretty(pg_relation_size('hs'));
  pg_size_pretty
 ----------------
  3568 KB
 (1 row)

HStore tables have a good compression effect because of their simple table structure and duplicate data. They are usually compressed three to five times more than row-store tables.

Batch query performance

It takes approximately four seconds to query the fourth column of the row-store table using the created table.

        
              explain analyze select d from data;
 explain analyze                                                               QUERY PLAN
 -----------------------------------------------------------------------------------------------------------------------------------------
   id |          operation           |        A-time        | A-rows  | E-rows  | Peak Memory  | E-memory | A-width | E-width | E-costs
  ----+------------------------------+----------------------+---------+---------+--------------+----------+---------+---------+----------
    1 | ->  Streaming (type: GATHER) | 4337.881             | 3276800 | 3276800 | 32KB         |          |         |       8 | 61891.00
    2 |    ->  Seq Scan on data      | [1571.995, 1571.995] | 3276800 | 3276800 | [32KB, 32KB] | 1MB      |         |       8 | 61266.00

It takes about 300 milliseconds to query the fourth column in the HStore Opt table.

        
              explain analyze select d from hs;
                                                                     QUERY PLAN
 ---------------------------------------------------------------------------------------------------------------------------------------------------
   id |               operation                |       A-time       | A-rows  | E-rows  |  Peak Memory   | E-memory | A-width | E-width | E-costs
  ----+----------------------------------------+--------------------+---------+---------+----------------+----------+---------+---------+----------
    1 | ->  Row Adapter                        | 335.280            | 3276800 | 3276800 | 24KB           |          |         |       8 | 15561.80
    2 |    ->  Vector Streaming (type: GATHER) | 111.492            | 3276800 | 3276800 | 96KB           |          |         |       8 | 15561.80
    3 |       ->  CStore Scan on hs            | [111.116, 111.116] | 3276800 | 3276800 | [254KB, 254KB] | 1MB      |         |       8 | 14936.80

Only the batch query scenario has been tested, and in this scenario, the stored tables and HStore tables perform better than row-store tables in terms of query performance.

Rectifying HStore Issues

When using a column-store delta table, merging may not occur promptly due to the absence of a scheduled merge mechanism for disk bandwidth and the column-store delta table. This can lead to delta table expansion and a decline in query performance and concurrent update efficiency.

In comparison to column-store delta tables, HStore is designed for concurrent data import, high-performance queries, and features an asynchronous merge mechanism. It can effectively replace column-store delta tables.

Here is a guide for transitioning common column-store and row-store tables to HStore:

Check the delta table list, replacing nspname with the target namespace.

       
            select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%') and op not like '%enable_hstore_opt%';

Execute steps 2 and 3, then sequentially run the table creation statement from step 2 and the derivative statement from step 3.

Generate the table creation statement for enable_hstore_opt.

       
            select 'create table if not exists '|| tbname ||'_opt (like '|| tbname ||' INCLUDING all EXCLUDING reloptions) with(orientation=column,enable_hstore_opt=on);' from(
select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%') and op not like '%enable_hstore_opt%');

Generate data. Replace the table name and delete the statement of the old table.

       
            select 'start transaction;
lock table '|| tbname ||' in EXCLUSIVE mode;
insert into '|| tbname ||'_opt select * from '|| tbname ||';
alter table '|| tbname ||' rename to '|| tbname ||'_bk;
alter table '|| tbname ||'_opt rename to '|| tbname ||';
commit;
drop table '|| tbname ||'_bk;'
from(select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%')  and op not like '%enable_hstore_opt%');