Best Practices of Column-Store Delta Tables
Working Principles
In GaussDB(DWS), data in a column-store table is stored by column. By default, 60,000 rows in each column are stored in a CU. A CU is the minimum unit for storing data in a column-store table. After a CU is generated, data in it is fixed and cannot be modified. No matter whether one or 60,000 data records are inserted into a column-store table, only one CU is generated. When a small amount of data is inserted into a column-store table for multiple times, it cannot be well depressed. As a result, data bloating occurs, which affects the query performance and disk usage.
Data in a CU file cannot be modified and can only be appended. Deleting the CU file data is to mark the old data as invalid in the dictionary. Updating the CU file data is to mark the old data as invalid and write a new record to the new CU. If a column-store table is updated or deleted for multiple times or only a small amount of data is inserted each time, the column-store table space bloats and a large amount of space cannot be effectively used.
Column-store tables are designed to import a large amount of data and store it by column for query. To solve the preceding problems, the delta table is introduced, which is a row-store table attached to a column-store table. After the delta table is enabled, when a single piece of data or a small batch of data is imported, the data is stored in the delta table to avoid small CUs. The addition, deletion, modification, and query of the delta table are the same as those of row-store tables. After the delta table is enabled, the performance of importing column-store tables is greatly improved.
Use Cases
The column-store delta table is used for hybrid row-column storage and is suitable for real-time analysis and statistics. It solves the performance problem caused by importing small batches of data and periodically merges the data to the primary table to ensure the analysis and query performance. You need to determine whether to enable delta tables based on the actual situation. Otherwise, the advantages of GaussDB(DWS) column-store tables cannot be fully utilized, wasting extra space and time.
Preparations
- You have registered a GaussDB(DWS) account and checked the account status before using GaussDB(DWS). The account cannot be in arrears or frozen.
- You have obtained the AK and SK of the account.
- The sample data has been uploaded to the traffic-data folder in an OBS bucket, and all Huawei Cloud accounts have been granted the read-only permission for accessing the OBS bucket. For details, see Checkpoint Vehicle Analysis.
Procedure
- Use DAS to connect to a cluster. Locate the required cluster in the cluster list and click Log In in the Operation column. On the DAS page that is displayed, enter the username, database name, and password, and test the connection. If the connection is successful, log in to the cluster. For details, see Using DAS to Connect to a Cluster.
- Execute the following statement to create the traffic database:
1
CREATE DATABASE traffic encoding 'utf8' template template0;
- Run the following statements to create database tables GCJL and GCJL2 for storing checkpoint vehicle information: By default, the delta table is not enabled for GCJL but is enabled for GCJL2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
CREATE SCHEMA traffic_data; SET current_schema= traffic_data; DROP TABLE if exists GCJL; CREATE TABLE GCJL ( kkbh VARCHAR(20), hphm VARCHAR(20), gcsj DATE , cplx VARCHAR(8), cllx VARCHAR(8), csys VARCHAR(8) ) with (orientation = column, COMPRESSION=MIDDLE) distribute by hash(hphm); DROP TABLE if exists GCJL2; CREATE TABLE GCJL2 ( kkbh VARCHAR(20), hphm VARCHAR(20), gcsj DATE , cplx VARCHAR(8), cllx VARCHAR(8), csys VARCHAR(8) ) with (orientation = column, COMPRESSION=MIDDLE, ENABLE_DELTA = TRUE) distribute by hash(hphm);
- Delta tables are disabled by default. To enable delta tables, set enable_delta to true when creating column-store tables.
- You can also run the following command to enable delta tables:
1
ALTER TABLE table_name SET (enable_delta=TRUE);
- If the delta table has been enabled, you can run the following command to disable it when required:
1
ALTER TABLE table_name SET (enable_delta=FALSE);
- Create a foreign table, which is used to identify and associate the source data on OBS.
- <obs_bucket_name> indicates the OBS bucket name. Only some regions are supported. For details about the supported regions and OBS bucket names, see Supported Regions. GaussDB(DWS) clusters do not support cross-region access to OBS bucket data.
- , and replace <Access_Key_Id> and <Secret_Access_Key> with the actual value.
- If the message "ERROR: schema "xxx" does not exist Position" is displayed when you create a foreign table, the schema does not exist. Perform the previous step to create a schema.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
DROP FOREIGN table if exists GCJL_OBS; CREATE FOREIGN TABLE GCJL_OBS ( like traffic_data.GCJL ) SERVER gsmpp_server OPTIONS ( encoding 'utf8', location 'obs://<obs_bucket_name>/traffic-data/gcxx', format 'text', delimiter ',', access_key '<Access_Key_Id>', secret_access_key '<Secret_Access_Key>', chunksize '64', IGNORE_EXTRA_DATA 'on' );
- Execute the following statement to import data from the foreign table to the database table:
1 2
INSERT INTO traffic_data.GCJL select * from GCJL_OBS; INSERT INTO traffic_data.GCJL2 select * from GCJL_OBS;
It takes some time to import data.
- Run the following statement to check the size of the storage space after the database table is imported:
1 2
SELECT pg_size_pretty(pg_total_relation_size('traffic_data.GCJL')); SELECT pg_size_pretty(pg_total_relation_size('traffic_data.GCJL2'));
After the delta table is enabled, the storage space usage is reduced from 8953 MB to 6053 MB, greatly improving the import performance.
- Run the following statement to query data in the table. The query speed is improved after the delta table is enabled.
1 2
SELECT * FROM traffic_data.GCJL where hphm = 'YD38641'; SELECT * FROM traffic_data.GCJL2 where hphm = 'YD38641';
Impact of Enabling the Delta Table
- Enabling the delta table function of a column-store table can prevent small CUs from being generated when a single piece of data or a small amount of data is imported to the table, hence improving performance. For example, if 100 pieces of data are imported each time in a cluster with 3 CNs and 6 DNs, the import time can be reduced by 25%, the storage space usage can be reduced by 97%. Therefore, you need to enable the delta table before inserting a small batch of data for multiple times and disable the delta table after confirming that no small batch of data needs to be imported.
- A delta table is a row-store table attached to a column-store table. After data is inserted into a delta table, the high compression ratio of the column-store table is lost. In normal cases, column-store tables are used to import a large amount of data. Therefore, the delta table is disabled by default, if the delta table is enabled when a large amount of data is imported, more time and space are consumed. If the delta table is enabled when 10,000 data records each time are imported in a cluster with 3 DNs and 6 DNs, the import speed is four times slower and more than 10 times of the space is consumed than that when the delta table is disabled. Therefore, exercise caution when enabling the delta table.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.