Best Practices for Using HStore Tables
Working Principles
In GaussDB(DWS), a CU is the smallest unit used to store data in a column-store table. By default, each column in the table stores 60,000 rows of data as a CU. Once generated, the data in a CU cannot be altered. A single CU is generated regardless of whether 1 or 60,000 records are inserted into a column-store table. Frequent small data insertions hinder effective compression, leading to data expansion, which negatively impacts query performance and disk usage.
CU file data can only be appended, not modified. Deleting data marks it as invalid in the dictionary, while updating marks old data as deleted and writes new data to a new CU. Repeated updates or deletions cause space expansion and inefficient space usage.
The column-store Delta table addresses the issue of small CUs from frequent small data imports but does not resolve lock conflicts from concurrent updates on the same CU. A hybrid data warehouse needs to work with data sources, such as upstream databases or applications. Therefore, in real-time import scenarios, concurrent insert, update, and delete operations are necessary for timely data import and high query efficiency.
HStore tables use additional delta tables. Batch-inserted data is written directly to CUs, maintaining the compression benefits of column-store tables. Updated columns and small batch inserts are serialized, compressed, and periodically merged into primary table CUs.
Use Cases
GaussDB(DWS) uses column storage format for HStore tables to minimize disk usage, enable high-concurrency updates, and enhance query performance. HStore tables are ideal for scenarios that demand real-time data import and query capabilities, as well as the ability to process traditional TP transactions.
To enhance performance, GaussDB(DWS) 8.3.0.100 has optimized HStore tables and kept the old ones for compatibility purposes. The optimized tables are known as HStore_opt tables. HStore tables can be replaced by HStore_opt tables for better performance, except in scenarios requiring high performance without micro-batch updates.
Creating HStore Tables and Related Views
1
|
CREATE TABLE test1 (i int,j text) with (orientation = column,enable_hstore=on); |
1
|
CREATE TABLE test2 (i int,j text) with (orientation = column,enable_hstore_opt=on); |
Check the number of tuples in the Delta table and the expansion status of the Delta table in the view.
1
|
SELECT * FROM pgxc_get_hstore_delta_info('tableName'); |
Use functions to perform lightweight cleanup and full cleanup on the Delta table.
- After the lightweight merge accumulates 60,000 I records and deletion information on the CU, the level-4 lock ceases to hinder the addition, deletion, modification, and querying of services. Nevertheless, the space is not freed up for the system.
1
select hstore_light_merge('tableName');
- Merging all records and truncating the Delta table is necessary to free up space for the system. Nonetheless, holding a level-8 lock will impede services.
1
select hstore_full_merge('tableName');
Insert 100 data records into the HStore table in batches. A record whose type is I (n_i_tup is 1) is generated.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
CREATE TABLE data(a int primary key, b int); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "data_pkey" for table "data" CREATE TABLE INSERT INTO data values(generate_series(1,100),1); INSERT 0 100 CREATE TABLE hs(a int primary key, b int)with(orientation=column, enable_hstore=on); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "hs_pkey" for table "hs" CREATE TABLE INSERT INTO hs SELECT * FROM data; INSERT 0 100 SELECT * FROM pgxc_get_hstore_delta_info('hs'); node_name | part_name | live_tup | n_i_type | n_d_type | n_x_type | n_u_type | n_m_type | data_size -----------+---------------------+----------+----------+----------+----------+----------+----------+----------- dn_1 | non partition table | 1 | 1 | 0 | 0 | 0 | 0 | 8192 (1 row) |
After hstore_full_merge is executed, no tuple exists in the Delta table (the value of live_tup is 0), and the value of data_size is 0.
1 2 3 4 5 6 7 8 9 10 11 |
SELECT hstore_full_merge('hs'); hstore_full_merge ------------------- 1 (1 row) SELECT * FROM pgxc_get_hstore_delta_info('hs'); node_name | part_name | live_tup | n_i_type | n_d_type | n_x_type | n_u_type | n_m_type | data_size -----------+---------------------+----------+----------+----------+----------+----------+----------+----------- dn_1 | non partition table | 0 | 0 | 0 | 0 | 0 | 0 | 0 (1 row) |
Perform the deletion. The Delta table contains a record whose type is D (n_d_tup is 1).
1 2 3 4 5 6 7 |
DELETE hs where a = 1; DELETE 1 SELECT * FROM pgxc_get_hstore_delta_info('hs'); node_name | part_name | live_tup | n_i_type | n_d_type | n_x_type | n_u_type | n_m_type | data_size -----------+---------------------+----------+----------+----------+----------+----------+----------+----------- dn_1 | non partition table | 1 | 0 | 1 | 0 | 0 | 0 | 8192 (1 row) |
Usage Practice
For optimal performance of HStore tables, it is crucial to configure the following parameter settings:
Set autovacuum_max_workers_hstore to 3, autovacuum_max_workers to 6, autovacuum to true, and enable_col_index_vacuum to on.
- Concurrent update
Once a batch of data is inserted into a column-store table, two sessions are initiated. In session 1, a piece of data is deleted, and the transaction is not terminated.
1 2 3 4 5 6 7 8 9 10 11
CREATE TABLE col(a int , b int)with(orientation=column); CREATE TABLE INSERT INTO col select * from data; INSERT 0 100 BEGIN; BEGIN DELETE col where a = 1; DELETE 1
When session 2 attempts to delete more data, it becomes evident that session 2 can only proceed after session 1 is committed. This scenario imitates the CU lock issue in column storage.1 2 3
BEGIN; BEGIN DELETE col where a = 2;
Repeat the previous experiment using the HStore table. Session 2 can be executed successfully without any lock wait.1 2 3 4
BEGIN; BEGIN DELETE hs where a = 2; DELETE 1
- Compression efficiency
Create a data table with 3 million data records.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
CREATE TABLE data( a int, b bigint, c varchar(10), d varchar(10)); CREATE TABLE INSERT INTO data values(generate_series(1,100),1,'asdfasdf','gergqer'); INSERT 0 100 INSERT INTO data select * from data; INSERT 0 100 INSERT INTO data select * from data; INSERT 0 200 ---Insert data cyclically until the data volume reaches 3 million. INSERT INTO data select * from data; INSERT 0 1638400 select count(*) from data; count --------- 3276800 (1 row)
Import data to a row-store table in batches and check whether the size is 223 MB.
1 2 3 4 5 6 7 8 9
CREATE TABLE row (like data including all); CREATE TABLE INSERT INTO row select * from data; INSERT 0 3276800 select pg_size_pretty(pg_relation_size('row')); pg_size_pretty ---------------- 223 MB (1 row)
Import data to a column-store table in batches and check whether the size is 3.5 MB.
1 2 3 4 5 6 7 8 9
CREATE TABLE hs(a int, b bigint, c varchar(10),d varchar(10))with(orientation= column, enable_hstore=on); CREATE TABLE INSERT INTO hs select * from data; INSERT 0 3276800 select pg_size_pretty(pg_relation_size('hs')); pg_size_pretty ---------------- 3568 KB (1 row)
HStore tables have a good compression effect because of their simple table structure and duplicate data. They are usually compressed three to five times more than row-store tables.
- Batch query performance
It takes approximately four seconds to query the fourth column of the row-store table using the created table.
1 2 3 4 5 6 7
explain analyze select d from data; explain analye QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- id | operation | A-time | A-rows | E-rows | Peak Memory | E-memory | A-width | E-width | E-costs ----+------------------------------+----------------------+---------+---------+--------------+----------+---------+---------+---------- 1 | -> Streaming (type: GATHER) | 4337.881 | 3276800 | 3276800 | 32KB | | | 8 | 61891.00 2 | -> Seq Scan on data | [1571.995, 1571.995] | 3276800 | 3276800 | [32KB, 32KB] | 1MB | | 8 | 61266.00
It takes about 300 milliseconds to query the fourth column of the HStore table.1 2 3 4 5 6 7 8
explain analyze select d from hs; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------- id | operation | A-time | A-rows | E-rows | Peak Memory | E-memory | A-width | E-width | E-costs ----+----------------------------------------+--------------------+---------+---------+----------------+----------+---------+---------+---------- 1 | -> Row Adapter | 335.280 | 3276800 | 3276800 | 24KB | | | 8 | 15561.80 2 | -> Vector Streaming (type: GATHER) | 111.492 | 3276800 | 3276800 | 96KB | | | 8 | 15561.80 3 | -> CStore Scan on hs | [111.116, 111.116] | 3276800 | 3276800 | [254KB, 254KB] | 1MB | | 8 | 14936.80
Only the batch query scenario has been tested, and in this scenario, the stored tables and HStore tables perform better than row-store tables in terms of query performance.
Requirements and Suggestions for Using HStore Tables
- Configure the parameters.
To optimize query performance and compression efficiency, the background thread should perform MERGE clearing on HStore tables. Prior to utilizing HStore tables, ensure that the relevant GUC parameters are configured correctly. The suggested parameter values can be found below.
autovacuum_max_workers_hstore: 3
autovacuum_max_workers: 6
autovacuum: true
enable_col_index_vacuum: on
- Suggestions on importing data to the database (The HStore_opt table is recommended.)
Suggestions for importing the HStore_opt table to the database:
- The performance of importing data using UPDATE is poor. You are advised to use UPSERT to import data.
- When using DELETE to import data, use index scanning. The JDBC batch method is recommended.
- When using UPSERT to import data, enable enable_hstore_nonconflict_upsert_optimization when there is no concurrency conflict and disable it in other scenarios. The optimal path is automatically selected.
- Use merge into only when importing over one million data records per DN and there is no concurrent data to prevent duplicate data.
- Point query suggestion (The HStore_opt table is recommended.)
Suggestions for querying the HStore_opt table:
- Create a level-2 partition on columns with evenly distributed distinct values and frequent equivalent filter criteria. Avoid level-2 partitions on columns with skewed or few distinct values.
- When dealing with fixed filter criteria columns (excluding level-2 partitions), use the cbtree index (up to 5 columns).
- When dealing with variable filter criteria columns (excluding level-2 partitions), use the GIN index (up to 5 columns).
- For all string columns involving equivalent filtering, bitmap indexes can be specified during table creation. The number of columns is not limited, but cannot be modified later.
- Specify columns that can be filtered by time range as the partition columns.
- If point queries return over 100,000 records per DN, index scanning may outperform non-index scanning. Use the GUC parameter enable_seqscan to compare performance.
- Index-related
Indexes occupy extra space and offer limited point query performance improvement. Create primary key or B-tree indexes only for upsert or unique/near-unique point queries.
- MERGE-related
HStore tables rely on background autovacuum for MERGE operations. Ensure data import speed does not exceed MERGE speed to prevent delta table expansion. Control import speed by managing concurrent data import tasks. Delta table space reuse is affected by oldestXmin. Old transactions can delay space reuse, causing expansion.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot