Help Center> Data Warehouse Service (DWS)> Best Practices> Advanced Features> Best Practices of Column-Store Delta Tables
Updated on 2024-06-07 GMT+08:00

Best Practices of Column-Store Delta Tables

Working Principles

In GaussDB(DWS), data in a column-store table is stored by column. By default, 60,000 rows in each column are stored in a CU. A CU is the minimum unit for storing data in a column-store table. After a CU is generated, data in it is fixed and cannot be modified. No matter whether one or 60,000 data records are inserted into a column-store table, only one CU is generated. When a small amount of data is inserted into a column-store table for multiple times, it cannot be well depressed. As a result, data bloating occurs, which affects the query performance and disk usage.

Data in a CU file cannot be modified and can only be appended. Deleting the CU file data is to mark the old data as invalid in the dictionary. Updating the CU file data is to mark the old data as invalid and write a new record to the new CU. If a column-store table is updated or deleted for multiple times or only a small amount of data is inserted each time, the column-store table space bloats and a large amount of space cannot be effectively used.

Column-store tables are designed to import a large amount of data and store it by column for query. To solve the preceding problems, the delta table is introduced, which is a row-store table attached to a column-store table. After the delta table is enabled, when a single piece of data or a small batch of data is imported, the data is stored in the delta table to avoid small CUs. The addition, deletion, modification, and query of the delta table are the same as those of row-store tables. After the delta table is enabled, the performance of importing column-store tables is greatly improved.

Use Cases

The column-store delta table is used for hybrid row-column storage and is suitable for real-time analysis and statistics. It solves the performance problem caused by importing small batches of data and periodically merges the data to the primary table to ensure the analysis and query performance. You need to determine whether to enable delta tables based on the actual situation. Otherwise, the advantages of GaussDB(DWS) column-store tables cannot be fully utilized, wasting extra space and time.

Preparations

  • You have registered a GaussDB(DWS) account and checked the account status before using GaussDB(DWS). The account cannot be in arrears or frozen.
  • You have obtained the AK and SK of the account.
  • The sample data has been uploaded to the traffic-data folder in an OBS bucket, and all Huawei Cloud accounts have been granted the read-only permission for accessing the OBS bucket. For details, see Checkpoint Vehicle Analysis.

Procedure

  1. Use DAS to connect to a cluster. Locate the required cluster in the cluster list and click Log In in the Operation column. On the DAS page that is displayed, enter the username, database name, and password, and test the connection. If the connection is successful, log in to the cluster. For details, see Using DAS to Connect to a Cluster.

  2. Execute the following statement to create the traffic database:

    1
    CREATE DATABASE traffic encoding 'utf8' template template0; 
    

  3. Run the following statements to create database tables GCJL and GCJL2 for storing checkpoint vehicle information: By default, the delta table is not enabled for GCJL but is enabled for GCJL2.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    CREATE SCHEMA traffic_data;
    SET current_schema= traffic_data;
    DROP TABLE if exists GCJL;
    CREATE TABLE GCJL
    (
            kkbh   VARCHAR(20), 
            hphm   VARCHAR(20),
            gcsj   DATE ,
            cplx   VARCHAR(8),
            cllx   VARCHAR(8),
            csys   VARCHAR(8)
    )
    with (orientation = column, COMPRESSION=MIDDLE)
    distribute by hash(hphm);
    
    DROP TABLE if exists GCJL2;
    CREATE TABLE GCJL2
    (
            kkbh   VARCHAR(20), 
            hphm   VARCHAR(20),
            gcsj   DATE ,
            cplx   VARCHAR(8),
            cllx   VARCHAR(8),
            csys   VARCHAR(8)
    )
    with (orientation = column, COMPRESSION=MIDDLE, ENABLE_DELTA = TRUE)
    distribute by hash(hphm);
    
    • Delta tables are disabled by default. To enable delta tables, set enable_delta to true when creating column-store tables.
    • If the delta table is not enabled when the table is created, you can run the ALTER TABLE command to enable the delta table after the table is created.
      1
      ALTER TABLE table_name SET (enable_delta=TRUE);
      
    • To disable the enabled the delta table, run the following command:
      1
      ALTER TABLE table_name SET (enable_delta=FALSE);
      

  4. Create a foreign table, which is used to identify and associate the source data on OBS.

    • <obs_bucket_name> indicates the OBS bucket name. Only some regions are supported. For details about the supported regions and OBS bucket names, see Supported Regions. GaussDB(DWS) clusters do not support cross-region access to OBS bucket data.
    • In this practice, the CN-Hong Kong region is used as an example. Enter dws-demo-ap-southeast-1, and replace <Access_Key_Id> and <Secret_Access_Key> with the actual value.
    • If the message "ERROR: schema "xxx" does not exist Position" is displayed when you create a foreign table, the schema does not exist. Perform the previous step to create a schema.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    DROP FOREIGN table if exists GCJL_OBS;
    CREATE FOREIGN TABLE GCJL_OBS
    (
            like traffic_data.GCJL
    )
    SERVER gsmpp_server 
    OPTIONS (
            encoding 'utf8',
            location 'obs://<obs_bucket_name>/traffic-data/gcxx',
            format 'text',
            delimiter ',',
            access_key '<Access_Key_Id>',
            secret_access_key '<Secret_Access_Key>',
            chunksize '64',
            IGNORE_EXTRA_DATA 'on'
    );
    

  5. Import data from a foreign table to a database table.

    1
    2
    INSERT INTO traffic_data.GCJL select * from GCJL_OBS;
    INSERT INTO traffic_data.GCJL2 select * from GCJL_OBS;
    

    It takes some time to import data.

  6. Run the following statement to check the size of the storage space after the database table is imported:

    1
    2
    SELECT pg_size_pretty(pg_total_relation_size('traffic_data.GCJL'));
    SELECT pg_size_pretty(pg_total_relation_size('traffic_data.GCJL2'));
    

    After the delta table is enabled, the storage space usage is reduced from 8953 MB to 6053 MB, greatly improving the import performance.

  7. Query data in the table. It is found that the query speed is improved after the delta table is enabled.

    1
    2
    SELECT * FROM traffic_data.GCJL where hphm =  'YD38641';
    SELECT * FROM traffic_data.GCJL2 where hphm =  'YD38641';
    

Impact of Enabling the Delta Table

  • Enabling the delta table function for a column-store table helps avoid the creation of small CUs during the import of individual or minimal data entries. This can significantly improve the performance. For instance, importing 100 data records into a cluster with three CNs and six DNs can cut import time by 25% and reduce storage usage by 97%. It is recommended to enable the delta table function prior to multiple insertions of small data batches. After ensuring that subsequent operations do not involve small data imports, the delta table function can be disabled.
  • A delta table, which operates as a row-store table associated to a column-store table, typically remains disabled to preserve the latter's high compression ratio. When enabled, the delta table can significantly increase both time and space requirements for large-scale data import. For instance, in a cluster with three CNs and six DNs, importing 10,000 records with the delta table enabled can be up to four times slower and occupy over ten times the space compared to when it's disabled It's recommended to enable or disable the delta table based on specific service needs.