Updated on 2024-12-13 GMT+08:00

Storing ClickHouse Cold and Hot Data Separately

This section applies to MRS 3.3.1-LTS or later.

Scenario

Based on the multi-volume storage of the open-source ClickHouse, MRS allows ClickHouse tables to be stored in volumes of multiple devices. You can define different types of disks in a volume and store the cold and hot data separately. Cold data is stored in OBS, and hot data is stored in ClickHouse. ClickHouse clusters provide optimal query performance and long-term data storage at a low cost.

The process of configuring separated storage for cold and hot data in ClickHouse is as follows:

  1. Creating an OBS Parallel File System
  2. Creating a Cloud Service Agency and Binding It to a Cluster
  3. Creating a Common Account Agency and Binding It to a Cluster
  4. Adding OBS Disk Information to the ClickHouse Cluster
  5. Configuring Disk Storage Policies
  6. Custom Cold-Hot Separation Policy
  7. Importing Data into ClickHouse to Verify the Configuration

Principles

OBS is a secure, reliable, and cost-effective distributed storage service that can handle large-scale data. ClickHouse adopts a cold and hot data separated architecture on top of OBS advantages. The SSD of the node where the ClickHouse instance is deployed stores the hot data (recently generated and frequently accessed). OBS stores the cold data (accessed less frequently than hot data). When creating a table, use TTL (a time policy) to store cold data and hot data separately.
Figure 1 Working principle of cold and hot data storage

Terms you need to know are described as follows:

Volume: a collection of ordered disks

Storage Policy: a storage policy, a collection of volumes, and rules for moving data among volumes

Creating an OBS Parallel File System

  1. Log in to the management console.
  2. Click Service List and choose Storage > Object Storage Service.
  3. Click Parallel File System and click Create Parallel File System. On the displayed page, set the following parameters, and click Create Now.
    • Region: Select the region where the MRS cluster is located.
    • File System Name: Enter a file system name, for example, mrs-ck-obs.
  4. Click the name of the created parallel file system. On the Overview page, view and record the value of Endpoint, which is the endpoint of OBS.
  5. Obtain the IAM endpoint.
    • To obtain the system endpoint, remove the prefix of the OBS endpoint, that is, obs..
    • The endpoint of IAM is iam.System endpoint.

    For example, if the endpoint of OBS is obs.xxx.huawei.com, the system endpoint is xxx.huawei.com and the IAM endpoint is iam.cxxx.huawei.com.

  1. Obtain the AK/SK information.
    1. Move the cursor to the login user name in the upper right corner and select My Credentials from the drop-down list.
    2. On the API Credentials page, obtain the Account ID whish is used as the domain ID.
    3. Click Access Keys, click Create Access Key, and enter the verification code or password. Click OK and download the access keys. Obtain the AK/SK information from the .csv file.

      After a service is deleted or a cluster is uninstalled, dirty data may remain in the parallel file system 3. You need to delete the dirty data.

Creating a Cloud Service Agency and Binding It to a Cluster

  1. Log in to the management console.
  2. In the service list, choose Management & Governance > Identity and Access Management.
  3. In the navigation pane on the left, choose Agencies and click Create Agency. On the displayed page, set the following parameters and click Next:
    • Agency Name: Enter an agency name, for example, mrs_clickhouse_obs.
    • Agency Type: Select Cloud service.
    • Cloud Service: Select ECS BMS.
    • Validity Period: Select Unlimited.
  4. On the displayed page, search for the OBS OperateAccess policy and select it.
  5. Click Next, click Show More, select Global services, and click OK.
  6. In the displayed dialog box, click OK to start authorization. Click Finish after the message "Authorization successful." is displayed.
  7. On the MRS console, choose Active Clusters in the navigation pane on the left and click the name of the target cluster to access its details page.
  8. On the Dashboard page, click Synchronize on the right of IAM User Sync to synchronize the IAM user.
  9. Click Manage Agency on the right of Agency, select the created agency, for example, mrs_clickhouse_obs, and click OK to bind the agency to the cluster.

Creating a Common Account Agency and Binding It to a Cluster

  1. Log in to the management console.
  2. In the service list, choose Management & Governance > Identity and Access Management.
  3. Choose Permissions > Policies/Roles, and click Create Custom Policy. Set the following parameters, and click OK:
    • Policy Name: Enter a policy name, for example, clickhouse-policy.
    • Policy View: Select JSON.
    • Policy Content: Enter the following content:
      {
          "Version": "1.1",
          "Statement": [
              {
                  "Effect":"Allow",
                  "Action":[
                      "OBS:*:*"
                      ]
              }
              ]
      }

  4. In the navigation pane on the left, click Agencies. Click Create Agency. On the displayed page, set the following parameters and click Next:
    • Agency Name: Enter an agency name, for example, agency-clickhouse-to-OBS.
    • Agency Type: Select Account.
    • Enter your cloud account in the Delegated Account field, that is, the account you register using your mobile phone number. It cannot be a federated user or an IAM user created using your cloud account.
    • Validity Period: Select Unlimited.
  5. In the search box on the displayed Authorize Agency page, search for the custom policy created in 3 and select it, for example, clickhouse-policy.
  6. Click Next, click Show More, select Global services, and click OK.
  7. Check and record the agency ID.
  8. On the Identity and Access Management page, click Agencies.
  9. Click the name of the cloud service agency created in 3, for example, mrs_clickhouse_obs.
  10. Click the Permissions tab and click Authorize. On the displayed page, click Create Policy in the upper right corner, and set the parameters as follows:
    • Policy Name: Enter a policy name, for example, clickhouse-assume-policy.
    • Policy View: Select JSON.
    • Policy Content: Enter the following content. {Agency ID} indicates the ID recorded in 7.
      {
          "Version": "1.1",
          "Statement": [
              {
                  "Action": [
                      "iam:agencies:assume"
                  ],
                  "Resource": {
                      "uri": [
                          "/iam/agencies/{Agency ID}"
                      ]
                  },
                  "Effect": "Allow"
              }
          ]
      }
  11. Click Next. On the Select Policy/Role page, select the policy created in 10.
  12. Click Next, click Show More, select Global services, and click OK.

Adding OBS Disk Information to the ClickHouse Cluster

  1. Log in to FusionInsight Manager and choose Cluster > Services > ClickHouse. Click Configurations then All Configurations, click ClickHouseServer(Role), and select Storage.
  2. OBS access credentials can be an agency or AK/SK. Choose either of them.
    • Configure an agency to obtain OBS access credentials.

      Parameter

      Description

      storage_configuration.obs.obs_domain_id

      Domain ID of the user who accesses IAM

      For details, see the domain ID obtained in Creating an OBS Parallel File System.

      storage_configuration.obs.obs_iam_endpoint

      For details, see the IAM endpoint obtained in Creating an OBS Parallel File System.

      storage_configuration.obs.obs_agency_name

      Agency name. For details about how to obtain the agency name, see Creating a Common Account Agency and Binding It to a Cluster. Example value: agency-clickhouse-to-OBS

    • Use AK/SK as the OBS access credentials.

      Parameter

      Description

      storage_configuration.obs.obs_user_ak

      AK of the IAM user in prerequisites

      To obtain the AK information, refer to 6.

      storage_configuration.obs.obs_user_sk

      SK of the IAM user in prerequisites

      To obtain the SK information, refer to 6.

      Figure 2 Adding AK/SK to OBS credentials
  3. In the clickhouse-config-customize parameter, add custom configuration items based on the following table to configure OBS disk information.
    Table 1 OBS disk parameters

    Parameter

    Description

    storage_configuration.disks.disk_s3.type

    Fixed at s3

    storage_configuration.disks.disk_s3.endpoint

    Access path of the created OBS parallel file system. The format is as follows:

    https://Parallel file system name.Parallel File Endpoint/Folder Name/

    • Parallel file system name: Name of the OBS parallel file system you cerated
    • Parallel file endpoint: To obtain the endpoint, click the OBS parallel file name and check the basic information on the overview page.
    • Folder name: name of the folder created in the OBS parallel file system

    disk_s3 indicates the OBS disk name, which can be customized.

    Figure 3 Configuring an OBS disk with clickhouse-config-customize
  4. (Optional) Cache data stored on OBS to local storage to accelerate data query. To use the local cache, add configuration items to clickhouse-config-customize.

    Parameter

    Description

    storage_configuration.disks.obs_cache.type

    Fixed at cache

    storage_configuration.disks.obs_cache.path

    Path for caching data on a local disk, for example, /srv/BigData/data1/clickhouse/cache/

    storage_configuration.disks.obs_cache.max_size

    Cache size, for example, 10 Gi

    storage_configuration.disks.obs_cache.disk

    Name of the OBS storage disk, for example, disk_s3

    obs_cache indicates the OBS cache name, which can be customized.

    Figure 4 Configuring an OBS cache with clickhouse-config-customize

Configuring Disk Storage Policies

Configure disk storage policies on FusionInsight Manager.

  1. Log in to FusionInsight Manager, choose Cluster > ClickHouse > Instances, and click a ClickHouseServer. On the displayed page, click the config.xml file.

  2. In the config.xml file, obtain the value of disk of storage_configuration.

  1. Choose Cluster > ClickHouse > Configurations > All Configurations > ClickHouse(Service) and search for the _clickhouse.storage_configuration.policies parameter.
    Set the following parameters listed in the table.

    Configuration Item

    Value

    Description

    storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.disk[1]

    disk1

    This is a configuration item for local disks. You can configure multiple local disks as you need. The value ranges from disk1 to diskn. For details about how to obtain the parameter value, see 2.

    hot_cold_separation_policy indicates the name of a policy item, which can be customized.

    hot_volume indicates the volume name in the hot data storage policy, which can be customized.

    storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.disk[2]

    disk2

    storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.disk[n]

    diskn

    storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.priority

    1

    Priority of volumes in the hot data storage policy. If the value is greater than that of storage_configuration.policies.hot_cold_separation_policy.volumes.cold_obs_volume.priority, the hot data disk has a higher priority than the cold data disk.

    storage_configuration.policies.hot_cold_separation_policy.volumes.cold_obs_volume.priority

    0

    Priority of volumes in the cold data storage policy. If the value is greater than that of storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.priority, the cold data disk has a higher priority than the hot data disk.

    storage_configuration.policies.hot_cold_separation_policy.volumes.cold_obs_volume.disk

    OBS disk or cache name

    cold_obs_volume indicates the volume name in the cold data OBS storage policy, which can be customized.

    • If you do not use the local disk cache, set this parameter to the name of the OBS disk, for example, disk_s3.
    • If you are using the local disk cache, set this parameter to the cache name, for example, obs_cache.

    storage_configuration.policies.hot_cold_separation_policy.move_factor

    0.2

    move_factor indicates when data is moved to another volume. When the available space of a volume is less than move_factor, data is automatically moved to the next volume. In this example, when the available space of the hot volume is less than 20%, data in this volume is automatically moved to the cold volume. The value range is 0 to 1.

    The following figure shows how to configure a storage policy.

  2. Save the configuration and restart ClickHouse.

Custom Cold-Hot Separation Policy

  1. Run the clickhouse client command to connect to the ClickHouseServer node. For details, see ClickHouse Client Practices.
  2. Create a ReplicatedMergeTree table for configuring the TTL for cold and hot storage.

    ClickHouse supports table-level TTLs and allows you to set time-based rules to move data between specified disks or volumes. Data is stored at different storage layers.

    The following is a table creation statement:
    CREATE TABLE example_table on cluster default_cluster
     ( 
    d DateTime, 
    a Int
    ) 
    ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/example_table', '{replica}')
    PARTITION BY toYYYYMM(d) 
    ORDER BY d 
    TTL d + INTERVAL 3 YEAR, 
    d + INTERVAL 1 YEAR TO VOLUME 'cold_obs_volume' 
    SETTINGS storage_policy = 'hot_cold_separation_policy';

    In this statement, the TTL configuration is in bold. This example deletes cold data older than three years from the example_table table, and moves cold data older than one year to OBS. The d column stores the time information about the data.

    The TTL policy is specified using a simple SQL expression that contains the time and time interval. For example:

    • The data will expire in three days since its creation (date_time).
      TTL date_time + INTERVAL 3 DAY
    • The data will expire one year later since its creation (date_time).
      TTL date_time + INTERVAL 1 YEAR

    The INTERVAL keyword supports second, minute, hour, day, week, month, quarter, and year.

    storage_policy indicates the specified custom storage policy. In this example, the value is the policy name _clickhouse.storage_configuration.disks defined in 3.

Importing Data into ClickHouse to Verify the Configuration

  1. Insert data for verification.
    insert into example_table values('2023-12-27','10086');  -- hot data
    insert into example_table values('2023-12-26','10086');  -- hot data
    insert into example_table values('2022-12-24','10086');  -- cold data
    insert into example_table values('2022-11-24','10086');  -- cold data
    insert into example_table values('2018-10-01','10086');  -- deleted data
    insert into example_table values('2017-10-01','10086');  -- deleted data
  2. Query table data.
    tesspmrB0002.mrs-dviw.com :) select * from example_table FORMAT CSV;
    SELECT *
    FROM example_table
    FORMAT CSV
    Query id: 3d6bfcb4-f082-4e97-8d95-f885e22ae689
     
    "2023-12-27 00:00:00",10086
    "2023-12-26 00:00:00",10086
    "2022-11-24 00:00:00",10086
    "2022-12-24 00:00:00",10086
     
    4 rows in set. Elapsed: 0.037 sec.

    Assume that current system time is April 2024. According to the TTL policy rule example_table, the data generated on 2018-10-01 and 2017-10-01 will be deleted.

  3. Query the storage paths of hot data and cold data.

    select name, partition, active, path from system.parts where database =' database name'and table='table name' and active = 1;

    Figure 5 Query result

    Assume that the current system time is April 2024. Data older than one year (d column values: 202211 and 202212) in the example_table table is stored in the disk_s3 OBS bucket.