Using Hybrid Data Warehouses and HStore Tables
Hybrid Data Warehouse Scenarios
As the intelligent data era evolves, the enterprise data ecosystem exhibits three significant characteristics: massive data expansion, diversified data types (including structured, semi-structured, and unstructured data), and increasing complex scenarios. To cope with these challenges, hybrid data warehouses emerge. Based on the large-scale data query and analysis capabilities, hybrid data warehouses feature high concurrency, high performance, low latency, and low cost in transaction processing. HStore tables play a key role in the digital transformation of the Internet, IoT, and traditional industries. Typical application scenarios are as follows:
- Intelligent user behavior analysis: It collects web page browsing logs in real time to construct user profiles throughout the life cycle and supports multi-dimensional behavior path analysis. Thanks to OLAP of hybrid data warehouses, it can calculate core indicators such as user retention rate and conversion funnel in seconds, facilitating refined operation decision-making.
- Real-time risk control center: In Internet finance and e-commerce transaction scenarios, the risk feature calculation engine is constructed to process real-time data in milliseconds. By associating multi-source user behavior data, it dynamically identifies abnormal patterns and intercepts fraudulent transactions in hundreds of milliseconds, ensuring service security.
- Industrial IoT and intelligent O&M: In traditional industries such as electric power and manufacturing, hybrid data warehouses can integrate massive device sensor data (including time series data flows such as vibration and temperature) and semi-structured data such as device maintenance logs to build a predictive maintenance model. Real-time trend analysis is used to dynamically monitor device health status and predict faults, transforming traditional passive O&M into intelligent preventive O&M.
Hybrid data warehouses support two efficient data import methods: direct method and buffer method.
Import Method |
Import Format |
How to Import Data |
Characteristics |
Scenario |
---|---|---|---|---|
Direct |
SQL |
Parse Change Data Capture (CDC) data into INSERT, DELETE, and UPDATE operations and transfer them to DWS. |
|
|
Buffer |
Micro-batch data |
Converts a large number of small transactions into micro-batch data in buffer mode. The batch import achieves high performance, and data can be synchronized in a short period of time. |
|
|
Storage Mechanism for Ordinary Column-Store Tables
In DWS, column-store tables use a compression unit (CU) as the smallest storage unit. By default, each CU stores 60,000 rows per column and operates in append write mode. This means that UPDATE and DELETE operations do not directly modify the original CU. Once a CU is created, its data cannot be altered, leading to the generation of a new complete CU whenever data is inserted, regardless of the quantity.
- DELETE operations: Old data is flagged as obsolete in the dictionary but not released, leading to potential space wastage.
- UPDATE operation: New records are written to a new CU after old data is marked as deleted.
- Space problem: Frequent UPDATE and DELETE operations may result in tablespace bloat, reducing effective storage utilization.
Advantages of HStore Tables
A HStore table uses an additional delta table to effectively balance storage and updates. Here are the key points:
Dimension |
Advantage |
---|---|
Batch data processing |
|
Incremental data processing |
|
Storage efficiency |
|
Performance |
|
Scenario |
|

To enhance performance, DWS 9.1.0 retains the old HStore table for forward compatibility. The optimized tables are known as HStore Opt tables. HStore tables can be replaced by HStore Opt tables for better performance, except in scenarios requiring high performance without micro-batch updates.
Suggestions on Using HStore Tables
- Parameter setting
Set the parameters according to Table 5 to improve query performance and storage for HStore tables:
- Suggestions on importing data to the database (HStore Opt table are recommended.)
- UPDATE operations:
- DELETE operations:
- Ensure that the execution plan is scanned by index.
- The JDBC batch mode is the most efficient.
- Batch data import:
- If the amount of data to be imported at once exceeds 1 million records per DN and the data is unique, consider using MERGE INTO.
- Use UPSERT for common scenarios.
- Suggestions on point queries (HStore Opt table are recommended.)
- Create a level-2 partition on columns with evenly distributed distinct values and frequent equivalent filter criteria. Avoid level-2 partitions on columns with skewed or few distinct values.
- When dealing with fixed filter criteria columns (excluding level-2 partitions), use the CBTREE index (up to 5 columns).
- When dealing with variable filter criteria columns (excluding level-2 partitions), use the GIN index (up to 5 columns).
- For all string columns involving equivalent filtering, use the BITMAP index during table creation. The number of columns is not limited, but cannot be modified later.
- Specify columns that can be filtered by time range as the partition columns.
- If point queries return over 100,000 records per DN, index scanning may outperform non-index scanning. Use the GUC parameter enable_seqscan to compare performance.
- Index-related precautions
- Indexes consume additional storage.
- Indexes are created to improve performance.
- Indexes are used when UPSERT operations need to be performed.
- Indexes are used for point queries with unique or near-unique requirements.
- MERGE-related precautions
- Data import speed control:
- The data import speed cannot exceed the MERGE processing capability.
- Controlling the concurrency of importing data to the database can prevent the Delta table from being expanded.
- Tablespace reuse
- Delta tablespace reuse is affected by oldestXmin.
- Long-running transactions may cause tablespace reuse delay and expansion.
- Data import speed control:
Hybrid Data Warehouse Flavors
To use hybrid data warehouse capabilities, choose the storage-compute coupled architecture when you create a cluster on the console and ensure the vCPU to memory ratio is 1:4 when setting up cloud disk flavors. For details about the hybrid data warehouse flavors and corresponding service scenarios, see Table 3.
Flavor |
CPU Architecture |
vCPU |
Memory (GB) |
Storage Capacity Per Node |
Step (GB) |
Number of DNs |
Scenario |
---|---|---|---|---|---|---|---|
dwsx2.h.xlarge.4.c7 |
x86 |
4 |
16 |
20 GB–2,000 GB |
20 |
1 |
Suitable for DWS starters. These flavors can be used for testing, learning environments, or small-scale analytics systems. |
dwsk2.h.xlarge.4.kc1 |
Arm |
4 |
16 |
20 GB–2,000 GB |
20 |
1 |
|
dwsk2.h.xlarge.kc2 |
Arm |
4 |
16 |
20 GB–2,000 GB |
20 |
1 |
|
dwsx2.h.xlarge.4.c7n |
x86 |
4 |
16 |
20 GB–2,000 GB |
20 |
1 |
|
dwsx2.h.2xlarge.4.c6 |
x86 |
8 |
32 |
100 GB–4,000 GB |
100 |
1 |
Suitable for internal data warehousing and report analysis in small- and medium-sized enterprises (SMEs). |
dwsx2.h.2xlarge.4.c7 |
x86 |
8 |
32 |
100 GB–4,000 GB |
100 |
1 |
|
dwsk2.h.2xlarge.4.kc1 |
Arm |
8 |
32 |
100 GB–4,000 GB |
100 |
1 |
|
dwsk2.h.2xlarge.kc2 |
Arm |
8 |
32 |
100 GB–4,000 GB |
100 |
1 |
|
dwsx2.h.2xlarge.4.c7n |
x86 |
8 |
32 |
100 GB–4,000 GB |
100 |
1 |
|
dwsx2.h.4xlarge.4.c7 |
x86 |
16 |
64 |
100 GB–8,000 GB |
100 |
1 |
Recommended for the production environment. These flavors are applicable to OLAP systems that have to deal with large data volumes, BI reports, and data visualizations on large screens for most companies. |
dwsk2.h.4xlarge.4.kc1 |
Arm |
16 |
64 |
100 GB–8,000 GB |
100 |
1 |
|
dwsk2.h.4xlarge.kc2 |
Arm |
16 |
64 |
100 GB–8,000 GB |
100 |
1 |
|
dwsx2.h.4xlarge.4.c7n |
x86 |
16 |
64 |
100 GB–8,000 GB |
100 |
1 |
|
dwsx2.h.8xlarge.4.c7 |
x86 |
32 |
128 |
100 GB–16,000 GB |
100 |
2 |
|
dwsk2.h.8xlarge.4.kc1 |
Arm |
32 |
128 |
100 GB–16,000 GB |
100 |
2 |
|
dwsk2.h.8xlarge.kc2 |
Arm |
32 |
128 |
100 GB–16,000 GB |
100 |
2 |
|
dwsx2.h.8xlarge.4.c7n |
x86 |
32 |
128 |
100 GB–16,000 GB |
100 |
2 |
|
dwsk2.h.12xlarge.4.kc1 |
Arm |
48 |
192 |
100 GB–24,000 GB |
100 |
4 |
These flavors can deliver excellent performance and are applicable to high-throughput data warehouse processing and high-concurrency online query. |
dwsk2.h.12xlarge.kc2 |
Arm |
48 |
192 |
100 GB–24,000 GB |
100 |
4 |
|
dwsx2.h.16xlarge.4.c7 |
x86 |
64 |
256 |
100 GB–32,000 GB |
100 |
4 |
|
dwsx2.h.16xlarge.4.c7n |
x86 |
64 |
256 |
100 GB–32,000 GB |
100 |
4 |
|
dwsk2.h.16xlarge |
Arm |
64 |
256 |
100 GB–32,000 GB |
100 |
4 |
|
dwsk2.h.24xlarge |
Arm |
96 |
384 |
100 GB–48,000 GB |
100 |
4 |
|
dwsk2.h.32xlarge |
Arm |
128 |
512 |
100 GB–64,000 GB |
100 |
4 |
Optimal GUC Parameters for Hybrid Data Warehouses
After creating a hybrid data warehouse, configure the GUC parameters as recommended in Table 4.
Parameter |
Description |
Type |
Value Range |
Recommended Value |
---|---|---|---|---|
enable_codegen |
Specifies whether to enable code optimization. Currently, LLVM optimization is used. |
USERSET |
|
off |
enable_numa_bind |
Specifies whether to enable NUMA binding. This parameter is available only for clusters of version 9.1.0.100 or later. |
SIGHUP |
|
Set the value to on for DNs and off for CNs. |
abnormal_check_general_task |
Specifies the interval at which the CM Agent periodically clears idle CN connections. |
CM parameter |
The value is a non-negative integer, in seconds. The default value is 60. |
3600 |
- Change the value of enable_codegen to off to reduce the memory overhead applied when dynamic execution code is generated for short queries.
- On the DWS console, choose Dedicated Clusters > Clusters.
- In the cluster list, find the target cluster and click the cluster name to go to the cluster details page.
- Click the Parameter Modifications tab, search for enable_codegen, change the value to off, and click Save.
- Set NUMA on DNs to on and NUMA on CNs to off. NUMA process binding can reduce the overhead of cross-NUMA access processes.
Contact technical support to change the value of enable_numa_bind.
- Change the value of abnormal_check_general_task to 3600 to reduce the overhead of repeatedly establishing connections. The default value is 60.
The default clearance interval is 60 seconds, which can significantly impact millisecond-level service performance. Re-creating a single thread incurs a cost of approximately 300 ms, so you are advised to increase the interval for scenarios with millisecond-level performance sensitivity. If connections are cleared slowly within the interval, it can result in high memory usage.
Contact technical support to change the value of abnormal_check_general_task.
Optimal Parameter Settings for Creating HStore Opt Tables in Hybrid Data Warehouses
When using hybrid data warehouses, you are advised to use HStore Opt tables. Before using such tables, set the following parameters by referring to Table 5.
Mandatory |
Parameter |
Description |
Recommended Value |
Take Effect Upon Restart |
---|---|---|---|---|
Yes |
autovacuum |
Specifies whether to enable autovacuum. The default value is on. Contact technical support to change the value of this parameter. |
on |
No |
autovacuum_max_workers |
Specifies the maximum number of concurrent autovacuum threads. 0 indicates that autovacuum is disabled. |
|
No |
|
autovacuum_max_workers_hstore |
Specifies the number of autovacuum worker threads for HStore tables. The value of autovacuum_max_workers must be greater than that of autovacuum_max_workers_hstore. You are advised to set autovacuum_max_workers to 6 and autovacuum_max_workers_hstore to 3. |
3 |
No |
|
enable_col_index_vacuum |
Specifies whether to enable index clearing to prevent index expansion and performance deterioration after data is updated and stored to the database. This parameter is supported only by clusters of 8.2.1.100 or later. Contact technical support to change the value of this parameter. |
on |
No |
|
No |
autovacuum_naptime |
Specifies the minimum delay between autovacuum runs on any given database. |
20s |
No |
colvacuum_threshold_scale_factor |
Specifies the minimum percentage of dead tuples for vacuum rewriting in column-store tables. A file is rewritten only when the ratio of dead tuples to (all_tuple - null_tuple) in the file is greater than the value of this parameter.
Contact technical support to change the value of this parameter. |
70 |
No |
|
col_min_file_size |
Specifies the minimum size of a file required to trigger a cleanup process. If the file size exceeds 128 MB, the file can be cleared. By default, the file can be cleared only when the file size exceeds 1 GB. You are advised to use this parameter in scenarios where updates or rollbacks are frequently performed. |
1 GB |
No |
|
autovacuum_compaction_rows_limit |
Controls the combination of small CUs and clearance of 0 CUs in the background. The value 0 indicates that only 0 CUs are cleared and small CUs are not processed. |
2500 |
No |
|
autovacuum_compaction_time_limit |
Specifies the number of minutes for triggering background 0CU clearance. Contact technical support to change the value of this parameter. |
1 |
No |
|
autovacuum_merge_cu_limit |
Specifies the number of CUs to be automatically merged in a transaction. The value 0 indicates that all CUs to be merged are processed in a transaction. Contact technical support to change the value of this parameter. |
10 |
No |
- Configuring GUC parameters.
- On the DWS console, choose Dedicated Clusters > Clusters.
- In the cluster list, find the target cluster and click the cluster name to go to the cluster details page.
- Modifying mandatory parameters: Click the Parameter Modifications tab, search for autovacuum_max_workers and autovacuum_max_workers_hstore, set them to the recommended values by referring to Table 5, and click Save.
You are advised to use the default values of autovacuum and enable_col_index_vacuum, and do not need to set them separately.
- Modify optional parameters: Search for autovacuum_naptime and cost_model_version, set them to the recommended values by referring to Table 5, and click Save.
For other optional parameters, contact technical support.
- Use the SQL editor to connect to the DWS cluster and create a HStore Opt table. The following is an example of creating such a table:
Select a proper distribution key and partition key based on the data characteristics. For details, see DWS Development Design Proposal.
1 2 3 4 5 6 7 8 9 10 11 12 13
CREATE table public.hstore_opt_table_demo( t_code character varying(20), t_gisid character varying(800), t_datatime timestamp(6) without time zone, t_gmid character varying(64) ) WITH (orientation=column, enable_hstore_opt=on) --This configuration is used by default when a table is created. DISTRIBUTE BY hash (t_gmid) --Distribution key, which can be a primary key or an associated column. PARTITION BY range (t_datatime) -- Partition key ( partition p2024_1 start('2024-01-01') end ('2024-06-01') every (interval '1 month'), partition p2024_7 start('2024-06-01') end ('2024-12-31') every (interval '1 month') );
- Adjust partitions when abnormal data is processed.
1
ALTER TABLE public.hstore_opt_table_demo ADD PARTITION pmax VALUES LESS THAN (maxvalue);
- For details about other usage suggestions and performance comparison of HStore Opt tables, see Performance Comparison Between HStore Tables and Ordinary Row- and Column-Store Tables.
Performance Comparison Between HStore Tables and Ordinary Row- and Column-Store Tables
- Concurrent Update
Once a batch of data is inserted into an ordinary column-store table, two sessions are initiated. In session 1, a piece of data is deleted, and the transaction is not committed.
1 2 3 4 5 6 7 8 9 10 11
CREATE TABLE col(a int , b int)with(orientation=column); CREATE TABLE INSERT INTO col select * from data; INSERT 0 100 BEGIN; BEGIN DELETE col where a = 1; DELETE 1
When session 2 attempts to delete more data, it becomes evident that session 2 can only proceed after session 1 is committed. This scenario imitates the CU lock issue in column storage.1 2 3
BEGIN; BEGIN DELETE col where a = 2;
Repeat the previous operations using a HStore table. Session 2 can be executed successfully without any lock wait.1 2 3 4
BEGIN; BEGIN DELETE hs where a = 2; DELETE 1
- Compression efficiency
Create a data table with 3 million data records.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
CREATE TABLE data( a int, b bigint, c varchar(10), d varchar(10)); CREATE TABLE INSERT INTO data values(generate_series(1,100),1,'asdfasdf','gergqer'); INSERT 0 100 INSERT INTO data select * from data; INSERT 0 100 INSERT INTO data select * from data; INSERT 0 200 ---Insert data cyclically until the data volume reaches 3 million. INSERT INTO data select * from data; INSERT 0 1638400 SELECT COUNT(*) FROM data; count --------- 3276800 (1 row)
Import data to a row-store table in batches and check whether the size is 223 MB.
1 2 3 4 5 6 7 8 9 10 11
CREATE TABLE row (like data including all); CREATE TABLE INSERT INTO row SELECT * FROM data; INSERT 0 3276800 SELECT pg_size_pretty(pg_relation_size('row')); pg_size_pretty ---------------- 223 MB (1 row)
Import data to a HStore Opt table in batches and check whether the size is 3.5 MB.
1 2 3 4 5 6 7 8 9 10 11 12
CREATE TABLE hs(a int, b bigint, c varchar(10),d varchar(10))with(orientation= column, enable_hstore_opt=on); CREATE TABLE INSERT INTO hs SELECT * FROM data; INSERT 0 3276800 SELECT pg_size_pretty(pg_relation_size('hs')); pg_size_pretty ---------------- 3568 KB (1 row)
HStore tables have a good compression effect because of their simple table structure and duplicate data. They are usually compressed three to five times more than row-store tables.
- Batch query performance
It takes approximately four seconds to query the fourth column in the row-store table.
1 2 3 4 5 6 7
EXPLAIN ANALYZE SELECT d FROM data; EXPLAIN ANALYZE QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- id | operation | A-time | A-rows | E-rows | Peak Memory | E-memory | A-width | E-width | E-costs ----+------------------------------+----------------------+---------+---------+--------------+----------+---------+---------+---------- 1 | -> Streaming (type: GATHER) | 4337.881 | 3276800 | 3276800 | 32KB | | | 8 | 61891.00 2 | -> Seq Scan on data | [1571.995, 1571.995] | 3276800 | 3276800 | [32KB, 32KB] | 1MB | | 8 | 61266.00
It takes about 300 milliseconds to query the fourth column in the HStore Opt table.1 2 3 4 5 6 7 8
EXPLAIN ANALYZE SELECT d from hs; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- id | operation | A-time | A-rows | E-rows | Peak Memory | E-memory | A-width | E-width | E-costs ----+----------------------------------------+--------------------+---------+---------+----------------+----------+---------+---------+---------- 1 | -> Row Adapter | 335.280 | 3276800 | 3276800 | 24KB | | | 8 | 15561.80 2 | -> Vector Streaming (type: GATHER) | 111.492 | 3276800 | 3276800 | 96KB | | | 8 | 15561.80 3 | -> CStore Scan on hs | [111.116, 111.116] | 3276800 | 3276800 | [254KB, 254KB] | 1MB | | 8 | 14936.80
The stored tables and HStore tables outperform row-store tables in terms of batch query.
Changing Ordinary Row- and Column-Store Tables to HStore Tables
When using a column-store delta table, MERGE operations may not occur promptly due to the absence of a scheduled merge mechanism for disk bandwidth and the column-store delta table. This can lead to delta table expansion and a decline in query performance and concurrent update efficiency.
In comparison to column-store delta tables, HStore tables feature concurrent data import, high-performance queries, and an asynchronous merge mechanism. HStore tables can replace column-store delta tables.
Here is a guide for changing ordinary column-store and row-store tables to HStore tables:
- Check the delta table list and replace nspname with the target namespace.
1
select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%') and op not like '%enable_hstore_opt%';
After steps 2 and 3 are executed, execute the table creation statements generated in step 2 and the data import statements generated in step 3 in sequence.
- Generate the table creation statement containing the enable_hstore_opt parameter.
1 2
select 'create table if not exists '|| tbname ||'_opt (like '|| tbname ||' INCLUDING all EXCLUDING reloptions) with(orientation=column,enable_hstore_opt=on);' from( select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%') and op not like '%enable_hstore_opt%');
- Generate the statement for data import. Replace the table name and delete the statement of the old table.
1 2 3 4 5 6 7 8
select 'start transaction; lock table '|| tbname ||' in EXCLUSIVE mode; insert into '|| tbname ||'_opt select * from '|| tbname ||'; alter table '|| tbname ||' rename to '|| tbname ||'_bk; alter table '|| tbname ||'_opt rename to '|| tbname ||'; commit; drop table '|| tbname ||'_bk;' from(select n.nspname||'.'||c.relname as tbname,reloptions::text as op from pg_class c,pg_namespace n where c.relnamespace = n.oid and c.relkind = 'r' and c.oid > 16384 and n.nspname ='public' and (op like '%enable_delta=on%' or op like '%enable_delta=true%') and op not like '%enable_hstore_opt%');
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.