Using Hybrid Data Warehouses and HStore Tables_Data Development_Best Practices

Hybrid Data Warehouse Scenarios

As the intelligent data era evolves, the enterprise data ecosystem exhibits three significant characteristics: massive data expansion, diversified data types (including structured, semi-structured, and unstructured data), and increasing complex scenarios. To cope with these challenges, hybrid data warehouses emerge. Based on the large-scale data query and analysis capabilities, hybrid data warehouses feature high concurrency, high performance, low latency, and low cost in transaction processing. HStore tables play a key role in the digital transformation of the Internet, IoT, and traditional industries. Typical application scenarios are as follows:

Intelligent user behavior analysis: It collects web page browsing logs in real time to construct user profiles throughout the life cycle and supports multi-dimensional behavior path analysis. Thanks to OLAP of hybrid data warehouses, it can calculate core indicators such as user retention rate and conversion funnel in seconds, facilitating refined operation decision-making.
Real-time risk control center: In Internet finance and e-commerce transaction scenarios, the risk feature calculation engine is constructed to process real-time data in milliseconds. By associating multi-source user behavior data, it dynamically identifies abnormal patterns and intercepts fraudulent transactions in hundreds of milliseconds, ensuring service security.
Industrial IoT and intelligent O&M: In traditional industries such as electric power and manufacturing, hybrid data warehouses can integrate massive device sensor data (including time series data flows such as vibration and temperature) and semi-structured data such as device maintenance logs to build a predictive maintenance model. Real-time trend analysis is used to dynamically monitor device health status and predict faults, transforming traditional passive O&M into intelligent preventive O&M.

Hybrid data warehouses support two efficient data import methods: direct method and buffer method.

**Table 1** Import methods
Import Method	Import Format	How to Import Data	Characteristics	Scenario
Direct	SQL	Parse Change Data Capture (CDC) data into INSERT, DELETE, and UPDATE operations and transfer them to DWS.	The table structure is the same as that in the remote storage. Easy deployment, short data link, and low latency	The amount of modified data is not large. There are high requirements on real-time data synchronization.
Buffer	Micro-batch data	Converts a large number of small transactions into micro-batch data in buffer mode. The batch import achieves high performance, and data can be synchronized in a short period of time.	Extended columns are supported for subsequent ETL services. The operation history is retained. The operation information at the source end is retained in extended columns. You can choose whether to retain the operation history. You can flexibly configure the access. Multiple instances can be deployed based on the service volume. Lightweight processes can be flexibly deployed.	A large amount of data is changed. The requirement on real-time synchronization is not strict. The historical operation information at the source end needs to be retained.

Storage Mechanism for Ordinary Column-Store Tables

In DWS, column-store tables use a compression unit (CU) as the smallest storage unit. By default, each CU stores 60,000 rows per column and operates in append write mode. This means that UPDATE and DELETE operations do not directly modify the original CU. Once a CU is created, its data cannot be altered, leading to the generation of a new complete CU whenever data is inserted, regardless of the quantity.

This approach can lead to several issues:

DELETE operations: Old data is flagged as obsolete in the dictionary but not released, leading to potential space wastage.
UPDATE operation: New records are written to a new CU after old data is marked as deleted.
Space problem: Frequent UPDATE and DELETE operations may result in tablespace bloat, reducing effective storage utilization.

Advantages of HStore Tables

A HStore table uses an additional delta table to effectively balance storage and updates. Here are the key points:

**Table 2** HStore table advantages
Dimension	Advantage
Batch data processing	Insert data in batches and write directly to the CU. Maintain the compression efficiency similar to traditional column-based storage.
Incremental data processing	Serialize and compress updated column data and perform small-batch insertions. Periodically execute the MERGE operation in the background to merge data into the primary table CU.
Storage efficiency	Minimize the disk space usage. Retain the high compression ratio in column-store format.
Performance	Support high-concurrency UPDATE operations. Offer excellent query response speed.
Scenario	Real-time data import and queries Traditional TP transaction processing High-concurrency loads involving both updates and queries

To enhance performance, DWS 9.1.0 retains the old HStore table for forward compatibility. The optimized tables are known as HStore Opt tables. HStore tables can be replaced by HStore Opt tables for better performance, except in scenarios requiring high performance without micro-batch updates.

Suggestions on Using HStore Tables

Parameter setting
Set the parameters according to Table 5 to improve query performance and storage for HStore tables:
Suggestions on importing data to the database (HStore Opt table are recommended.)
1. UPDATE operations:
  Use the UPSERT mode instead of the UPDATE mode.
2. DELETE operations:
  - Ensure that the execution plan is scanned by index.
  - The JDBC batch mode is the most efficient.
3. Batch data import:
  - If the amount of data to be imported at once exceeds 1 million records per DN and the data is unique, consider using MERGE INTO.
  - Use UPSERT for common scenarios.
Suggestions on point queries (HStore Opt table are recommended.)
1. Create a level-2 partition on columns with evenly distributed distinct values and frequent equivalent filter criteria. Avoid level-2 partitions on columns with skewed or few distinct values.
2. When dealing with fixed filter criteria columns (excluding level-2 partitions), use the CBTREE index (up to 5 columns).
3. When dealing with variable filter criteria columns (excluding level-2 partitions), use the GIN index (up to 5 columns).
4. For all string columns involving equivalent filtering, use the BITMAP index during table creation. The number of columns is not limited, but cannot be modified later.
5. Specify columns that can be filtered by time range as the partition columns.
6. If point queries return over 100,000 records per DN, index scanning may outperform non-index scanning. Use the GUC parameter enable_seqscan to compare performance.
Index-related precautions
1. Indexes consume additional storage.
2. Indexes are created to improve performance.
3. Indexes are used when UPSERT operations need to be performed.
4. Indexes are used for point queries with unique or near-unique requirements.

MERGE-related precautions
1. Data import speed control:
  - The data import speed cannot exceed the MERGE processing capability.
  - Controlling the concurrency of importing data to the database can prevent the Delta table from being expanded.
2. Tablespace reuse
  - Delta tablespace reuse is affected by oldestXmin.
  - Long-running transactions may cause tablespace reuse delay and expansion.

Hybrid Data Warehouse Flavors

To use hybrid data warehouse capabilities, choose the storage-compute coupled architecture when you create a cluster on the console and ensure the vCPU to memory ratio is 1:4 when setting up cloud disk flavors. For details about the hybrid data warehouse flavors and corresponding service scenarios, see Table 3.

**Table 3** Cloud disk flavors with a vCPU to memory ratio of 1:4 for storage-compute clusters
Flavor	CPU Architecture	vCPU	Memory (GB)	Storage Capacity Per Node	Step (GB)	Number of DNs	Scenario
dwsx2.h.xlarge.4.c7	x86	4	16	20 GB–2,000 GB	20	1	Suitable for DWS starters. These flavors can be used for testing, learning environments, or small-scale analytics systems.
dwsk2.h.xlarge.4.kc1	Arm	4	16	20 GB–2,000 GB	20	1
dwsk2.h.xlarge.kc2	Arm	4	16	20 GB–2,000 GB	20	1
dwsx2.h.xlarge.4.c7n	x86	4	16	20 GB–2,000 GB	20	1
dwsx2.h.2xlarge.4.c6	x86	8	32	100 GB–4,000 GB	100	1	Suitable for internal data warehousing and report analysis in small- and medium-sized enterprises (SMEs).
dwsx2.h.2xlarge.4.c7	x86	8	32	100 GB–4,000 GB	100	1
dwsk2.h.2xlarge.4.kc1	Arm	8	32	100 GB–4,000 GB	100	1
dwsk2.h.2xlarge.kc2	Arm	8	32	100 GB–4,000 GB	100	1
dwsx2.h.2xlarge.4.c7n	x86	8	32	100 GB–4,000 GB	100	1
dwsx2.h.4xlarge.4.c7	x86	16	64	100 GB–8,000 GB	100	1	Recommended for the production environment. These flavors are applicable to OLAP systems that have to deal with large data volumes, BI reports, and data visualizations on large screens for most companies.
dwsk2.h.4xlarge.4.kc1	Arm	16	64	100 GB–8,000 GB	100	1
dwsk2.h.4xlarge.kc2	Arm	16	64	100 GB–8,000 GB	100	1
dwsx2.h.4xlarge.4.c7n	x86	16	64	100 GB–8,000 GB	100	1
dwsx2.h.8xlarge.4.c7	x86	32	128	100 GB–16,000 GB	100	2
dwsk2.h.8xlarge.4.kc1	Arm	32	128	100 GB–16,000 GB	100	2
dwsk2.h.8xlarge.kc2	Arm	32	128	100 GB–16,000 GB	100	2
dwsx2.h.8xlarge.4.c7n	x86	32	128	100 GB–16,000 GB	100	2
dwsk2.h.12xlarge.4.kc1	Arm	48	192	100 GB–24,000 GB	100	4	These flavors can deliver excellent performance and are applicable to high-throughput data warehouse processing and high-concurrency online query.
dwsk2.h.12xlarge.kc2	Arm	48	192	100 GB–24,000 GB	100	4
dwsx2.h.16xlarge.4.c7	x86	64	256	100 GB–32,000 GB	100	4
dwsx2.h.16xlarge.4.c7n	x86	64	256	100 GB–32,000 GB	100	4
dwsk2.h.16xlarge	Arm	64	256	100 GB–32,000 GB	100	4
dwsk2.h.24xlarge	Arm	96	384	100 GB–48,000 GB	100	4
dwsk2.h.32xlarge	Arm	128	512	100 GB–64,000 GB	100	4

Optimal GUC Parameters for Hybrid Data Warehouses

After creating a hybrid data warehouse, configure the GUC parameters as recommended in Table 4.

**Table 4** GUC parameters
Parameter	Description	Type	Value Range	Recommended Value
enable_codegen	Specifies whether to enable code optimization. Currently, LLVM optimization is used.	USERSET	on indicates code optimization can be enabled. off indicates code optimization cannot be enabled.	off
enable_numa_bind	Specifies whether to enable NUMA binding. This parameter is available only for clusters of version 9.1.0.100 or later.	SIGHUP	on indicates that NUMA binding is enabled. off indicates that NUMA binding is disabled.	Set the value to on for DNs and off for CNs.
abnormal_check_general_task	Specifies the interval at which the CM Agent periodically clears idle CN connections.	CM parameter	The value is a non-negative integer, in seconds. The default value is 60.	3600

Change the value of enable_codegen to off to reduce the memory overhead applied when dynamic execution code is generated for short queries.
1. On the DWS console, choose Dedicated Clusters > Clusters.
2. In the cluster list, find the target cluster and click the cluster name to go to the cluster details page.
3. Click the Parameter Modifications tab, search for enable_codegen, change the value to off, and click Save.
Set NUMA on DNs to on and NUMA on CNs to off. NUMA process binding can reduce the overhead of cross-NUMA access processes.

Contact technical support to change the value of enable_numa_bind.
Change the value of abnormal_check_general_task to 3600 to reduce the overhead of repeatedly establishing connections. The default value is 60.

The default clearance interval is 60 seconds, which can significantly impact millisecond-level service performance. Re-creating a single thread incurs a cost of approximately 300 ms, so you are advised to increase the interval for scenarios with millisecond-level performance sensitivity. If connections are cleared slowly within the interval, it can result in high memory usage.

Contact technical support to change the value of abnormal_check_general_task.

Optimal Parameter Settings for Creating HStore Opt Tables in Hybrid Data Warehouses

When using hybrid data warehouses, you are advised to use HStore Opt tables. Before using such tables, set the following parameters by referring to Table 5.

**Table 5** Parameters related to HStore Opt tables
Mandatory	Parameter	Description	Recommended Value	Take Effect Upon Restart
Yes	autovacuum	Specifies whether to enable autovacuum. The default value is on. Contact technical support to change the value of this parameter.	on	No
	autovacuum_max_workers	Specifies the maximum number of concurrent autovacuum threads. 0 indicates that autovacuum is disabled.	The default value of this parameter is 6 for clusters of version 9.1.0.xxx or earlier. The default value of this parameter is 2 for clusters of version 9.1.1 or later.	No
	autovacuum_max_workers_hstore	Specifies the number of automerge worker threads for HStore tables. The value of autovacuum_max_workers must be greater than that of autovacuum_max_workers_hstore. You are advised to set autovacuum_max_workers to 6 and autovacuum_max_workers_hstore to 3.	3	No
	enable_col_index_vacuum	Specifies whether to enable index clearing to prevent index expansion and performance deterioration after data is updated and stored to the database. This parameter is supported only by clusters of 8.2.1.100 or later. Contact technical support to change the value of this parameter.	on	No
No	autovacuum_naptime	Specifies the minimum delay between autovacuum runs on any given database.	20s	No
	colvacuum_threshold_scale_factor	Specifies the minimum percentage of dead tuples for vacuum rewriting in column-store tables. A file is rewritten only when the ratio of dead tuples to (all_tuple - null_tuple) in the file is greater than the value of this parameter. –2 indicates that vacuum rewriting and vacuum cleanup are not performed. –1 indicates that vacuum rewriting is not performed and only vacuum cleanup is performed. The value ranges from 0 to 100, indicating the percentage of dead tuples. Contact technical support to change the value of this parameter.	70	No
	col_min_file_size	Specifies the minimum size of a file required to trigger a cleanup process. If the file size exceeds 128 MB, the file can be cleared. By default, the file can be cleared only when the file size exceeds 1 GB. You are advised to use this parameter in scenarios where updates or rollbacks are frequently performed.	1 GB	No
	autovacuum_compaction_rows_limit	Controls the combination of small CUs and clearance of 0 CUs in the background. The value 0 indicates that only 0 CUs are cleared and small CUs are not processed.	2500	No
	autovacuum_compaction_time_limit	Specifies the number of minutes for triggering background 0CU clearance. Contact technical support to change the value of this parameter.	1	No
	autovacuum_merge_cu_limit	Specifies the number of CUs to be automatically merged in a transaction. The value 0 indicates that all CUs to be merged are processed in a transaction. Contact technical support to change the value of this parameter.	10	No

Configuring GUC parameters.
1. On the DWS console, choose Dedicated Clusters > Clusters.
2. In the cluster list, find the target cluster and click the cluster name to go to the cluster details page.
3. Modifying mandatory parameters: Click the Parameter Modifications tab, search for autovacuum_max_workers and autovacuum_max_workers_hstore, set them to the recommended values by referring to Table 5, and click Save.
  
  You are advised to use the default values of autovacuum and enable_col_index_vacuum, and do not need to set them separately.
4. Modify optional parameters: Search for autovacuum_naptime and cost_model_version, set them to the recommended values by referring to Table 5, and click Save.
  
  For other optional parameters, contact technical support.

Use the SQL editor to connect to the DWS cluster and create a HStore Opt table. The following is an example of creating such a table:

Select a proper distribution key and partition key based on the data characteristics. For details, see DWS Development Design Proposal.

      
           CREATE table public.hstore_opt_table_demo(
t_code character varying(20),
t_gisid character varying(800),
t_datatime timestamp(6) without time zone,
t_gmid character varying(64)
)
WITH (orientation=column, enable_hstore_opt=on) --This configuration is used by default when a table is created.
DISTRIBUTE BY hash (t_gmid) --Distribution key, which can be a primary key or an associated column.
PARTITION BY range (t_datatime) -- Partition key
(
partition p2024_1 start('2024-01-01') end ('2024-06-01') every (interval '1 month'),
partition p2024_7 start('2024-06-01') end ('2024-12-31') every (interval '1 month')
);

Adjust partitions when abnormal data is processed.

      
           ALTER TABLE public.hstore_opt_table_demo ADD PARTITION pmax VALUES LESS THAN (maxvalue);

For details about other usage suggestions and performance comparison of HStore Opt tables, see Performance Comparison Between HStore Tables and Ordinary Row- and Column-Store Tables.

Performance Comparison Between HStore Tables and Ordinary Row- and Column-Store Tables

Concurrent Update

Once a batch of data is inserted into an ordinary column-store table, two sessions are initiated. In session 1, a piece of data is deleted, and the transaction is not committed.

       
             CREATE TABLE col(a int , b int)with(orientation=column);
 CREATE TABLE

 INSERT INTO col select * from data;
 INSERT 0 100

 BEGIN;
 BEGIN

 DELETE col where a = 1;
 DELETE 1

When session 2 attempts to delete more data, it becomes evident that session 2 can only proceed after session 1 is committed. This scenario imitates the CU lock issue in column storage.

       
             BEGIN;
 BEGIN
 DELETE col where a = 2;

Repeat the previous operations using a HStore table. Session 2 can be executed successfully without any lock wait.

       
             BEGIN;
 BEGIN
 DELETE hs where a = 2;
 DELETE 1

Compression efficiency

Create a data table with 3 million data records.

       
             CREATE TABLE data( a int, b bigint, c varchar(10), d varchar(10));
 CREATE TABLE

 INSERT INTO data values(generate_series(1,100),1,'asdfasdf','gergqer');
 INSERT 0 100
 INSERT INTO data select * from data;
 INSERT 0 100
 INSERT INTO data select * from data;
 INSERT 0 200
 
 ---Insert data cyclically until the data volume reaches 3 million.
 
 INSERT INTO data select * from data;
 INSERT 0 1638400
 
 SELECT COUNT(*) FROM data;
   count
 ---------
  3276800
 (1 row)

Import data to a row-store table in batches and check whether the size is 223 MB.

      
            CREATE TABLE row (like data including all);
 CREATE TABLE
 
 INSERT INTO row SELECT * FROM data;
 INSERT 0 3276800
 
 SELECT pg_size_pretty(pg_relation_size('row'));
  pg_size_pretty
 ----------------
  223 MB
 (1 row)

Import data to a HStore Opt table in batches and check whether the size is 3.5 MB.

      
            CREATE TABLE hs(a int, b bigint, c varchar(10),d varchar(10))with(orientation= column, enable_hstore_opt=on);
 CREATE TABLE
 
 INSERT INTO hs SELECT * FROM data;
 INSERT 0 3276800
 
 SELECT pg_size_pretty(pg_relation_size('hs'));

  pg_size_pretty
 ----------------
  3568 KB
 (1 row)

HStore tables have a good compression effect because of their simple table structure and duplicate data. They are usually compressed three to five times more than row-store tables.

Batch query performance

It takes approximately four seconds to query the fourth column in the row-store table.

       
             EXPLAIN ANALYZE SELECT d FROM data;
 EXPLAIN ANALYZE                                                               QUERY PLAN
 -----------------------------------------------------------------------------------------------------------------------------------------
   id |          operation           |        A-time        | A-rows  | E-rows  | Peak Memory  | E-memory | A-width | E-width | E-costs
  ----+------------------------------+----------------------+---------+---------+--------------+----------+---------+---------+----------
    1 | ->  Streaming (type: GATHER) | 4337.881             | 3276800 | 3276800 | 32KB         |          |         |       8 | 61891.00
    2 |    ->  Seq Scan on data      | [1571.995, 1571.995] | 3276800 | 3276800 | [32KB, 32KB] | 1MB      |         |       8 | 61266.00

It takes about 300 milliseconds to query the fourth column in the HStore Opt table.

       
             EXPLAIN ANALYZE SELECT d from hs;
                                                                     QUERY PLAN
 ---------------------------------------------------------------------------------------------------------------------------------------------------
   id |               operation                |       A-time       | A-rows  | E-rows  |  Peak Memory   | E-memory | A-width | E-width | E-costs
  ----+----------------------------------------+--------------------+---------+---------+----------------+----------+---------+---------+----------
    1 | ->  Row Adapter                        | 335.280            | 3276800 | 3276800 | 24KB           |          |         |       8 | 15561.80
    2 |    ->  Vector Streaming (type: GATHER) | 111.492            | 3276800 | 3276800 | 96KB           |          |         |       8 | 15561.80
    3 |       ->  CStore Scan on hs            | [111.116, 111.116] | 3276800 | 3276800 | [254KB, 254KB] | 1MB      |         |       8 | 14936.80

The stored tables and HStore tables outperform row-store tables in terms of batch query.

Changing Ordinary Row- and Column-Store Tables to HStore Tables

When using a column-store delta table, MERGE operations may not occur promptly due to the absence of a scheduled merge mechanism for disk bandwidth and the column-store delta table. This can lead to delta table expansion and a decline in query performance and concurrent update efficiency.

In comparison to column-store delta tables, HStore tables feature concurrent data import, high-performance queries, and an asynchronous merge mechanism. HStore tables can replace column-store delta tables.

Here is a guide for changing ordinary column-store and row-store tables to HStore tables:

Check the delta table list and replace nspname with the target namespace.

      
           select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%') and op not like '%enable_hstore_opt%';

After steps 2 and 3 are executed, execute the table creation statements generated in step 2 and the data import statements generated in step 3 in sequence.

Generate the table creation statement containing the enable_hstore_opt parameter.

      
           select 'create table if not exists '|| tbname ||'_opt (like '|| tbname ||' INCLUDING all EXCLUDING reloptions) with(orientation=column,enable_hstore_opt=on);' from(
select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%') and op not like '%enable_hstore_opt%');

Generate the statement for data import. Replace the table name and delete the statement of the old table.

      
           select 'start transaction;
lock table '|| tbname ||' in EXCLUSIVE mode;
insert into '|| tbname ||'_opt select * from '|| tbname ||';
alter table '|| tbname ||' rename to '|| tbname ||'_bk;
alter table '|| tbname ||'_opt rename to '|| tbname ||';
commit;
drop table '|| tbname ||'_bk;'
from(select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%')  and op not like '%enable_hstore_opt%');

Using Hybrid Data Warehouses and HStore Tables