Storing ClickHouse Cold and Hot Data Separately
This section applies to MRS 3.3.1-LTS or later.
Scenario
Based on the multi-volume storage of the open-source ClickHouse, MRS allows ClickHouse tables to be stored in volumes of multiple devices. You can define different types of disks in a volume and store the cold and hot data separately. Cold data is stored in OBS, and hot data is stored in ClickHouse. ClickHouse clusters provide optimal query performance and long-term data storage at a low cost.
The process of configuring separated storage for cold and hot data in ClickHouse is as follows:
- Creating an OBS Parallel File System
- Creating a Cloud Service Agency and Binding It to a Cluster
- Creating a Common Account Agency and Binding It to a Cluster
- Adding OBS Disk Information to the ClickHouse Cluster
- Configuring Disk Storage Policies
- Custom Cold-Hot Separation Policy
- Importing Data into ClickHouse to Verify the Configuration
Principles
Terms you need to know are described as follows:
Volume: a collection of ordered disks
Storage Policy: a storage policy, a collection of volumes, and rules for moving data among volumes
Creating an OBS Parallel File System
- Log in to the management console.
- Click Service List and choose Storage > Object Storage Service.
- Click Parallel File System and click Create Parallel File System. On the displayed page, set the following parameters, and click Create Now.
- Region: Select the region where the MRS cluster is located.
- File System Name: Enter a file system name, for example, mrs-ck-obs.
- Click the name of the created parallel file system. On the Overview page, view and record the value of Endpoint, which is the endpoint of OBS.
- Obtain the IAM endpoint.
- To obtain the system endpoint, remove the prefix of the OBS endpoint, that is, obs..
- The endpoint of IAM is iam.System endpoint.
For example, if the endpoint of OBS is obs.xxx.huawei.com, the system endpoint is xxx.huawei.com and the IAM endpoint is iam.cxxx.huawei.com.
- Obtain the AK/SK information.
- Move the cursor to the login user name in the upper right corner and select My Credentials from the drop-down list.
- On the API Credentials page, obtain the Account ID whish is used as the domain ID.
- Click Access Keys, click Create Access Key, and enter the verification code or password. Click OK and download the access keys. Obtain the AK/SK information from the .csv file.
After a service is deleted or a cluster is uninstalled, dirty data may remain in the parallel file system 3. You need to delete the dirty data.
Creating a Cloud Service Agency and Binding It to a Cluster
- Log in to the management console.
- In the service list, choose Management & Governance > Identity and Access Management.
- In the navigation pane on the left, choose Agencies and click Create Agency. On the displayed page, set the following parameters and click Next:
- Agency Name: Enter an agency name, for example, mrs_clickhouse_obs.
- Agency Type: Select Cloud service.
- Cloud Service: Select ECS BMS.
- Validity Period: Select Unlimited.
- On the displayed page, search for the OBS OperateAccess policy and select it.
- Click Next, click Show More, select Global services, and click OK.
- In the displayed dialog box, click OK to start authorization. Click Finish after the message "Authorization successful." is displayed.
- On the MRS console, choose Active Clusters in the navigation pane on the left and click the name of the target cluster to access its details page.
- On the Dashboard page, click Synchronize on the right of IAM User Sync to synchronize the IAM user.
- Click Manage Agency on the right of Agency, select the created agency, for example, mrs_clickhouse_obs, and click OK to bind the agency to the cluster.
Creating a Common Account Agency and Binding It to a Cluster
- Log in to the management console.
- In the service list, choose Management & Governance > Identity and Access Management.
- Choose Permissions > Policies/Roles, and click Create Custom Policy. Set the following parameters, and click OK:
- Policy Name: Enter a policy name, for example, clickhouse-policy.
- Policy View: Select JSON.
- Policy Content: Enter the following content:
{ "Version": "1.1", "Statement": [ { "Effect":"Allow", "Action":[ "OBS:*:*" ] } ] }
- In the navigation pane on the left, click Agencies. Click Create Agency. On the displayed page, set the following parameters and click Next:
- Agency Name: Enter an agency name, for example, agency-clickhouse-to-OBS.
- Agency Type: Select Account.
- Enter your cloud account in the Delegated Account field, that is, the account you register using your mobile phone number. It cannot be a federated user or an IAM user created using your cloud account.
- Validity Period: Select Unlimited.
- In the search box on the displayed Authorize Agency page, search for the custom policy created in 3 and select it, for example, clickhouse-policy.
- Click Next, click Show More, select Global services, and click OK.
- Check and record the agency ID.
- On the Identity and Access Management page, click Agencies.
- Click the name of the cloud service agency created in 3, for example, mrs_clickhouse_obs.
- Click the Permissions tab and click Authorize. On the displayed page, click Create Policy in the upper right corner, and set the parameters as follows:
- Policy Name: Enter a policy name, for example, clickhouse-assume-policy.
- Policy View: Select JSON.
- Policy Content: Enter the following content. {Agency ID} indicates the ID recorded in 7.
{ "Version": "1.1", "Statement": [ { "Action": [ "iam:agencies:assume" ], "Resource": { "uri": [ "/iam/agencies/{Agency ID}" ] }, "Effect": "Allow" } ] }
- Click Next. On the Select Policy/Role page, select the policy created in 10.
- Click Next, click Show More, select Global services, and click OK.
Adding OBS Disk Information to the ClickHouse Cluster
- Log in to FusionInsight Manager and choose Cluster > Services > ClickHouse. Click Configurations then All Configurations, click ClickHouseServer(Role), and select Storage.
- OBS access credentials can be an agency or AK/SK. Choose either of them.
- Configure an agency to obtain OBS access credentials.
Parameter
Description
storage_configuration.obs.obs_domain_id
Domain ID of the user who accesses IAM
For details, see the domain ID obtained in Creating an OBS Parallel File System.
storage_configuration.obs.obs_iam_endpoint
For details, see the IAM endpoint obtained in Creating an OBS Parallel File System.
storage_configuration.obs.obs_agency_name
Agency name. For details about how to obtain the agency name, see Creating a Common Account Agency and Binding It to a Cluster. Example value: agency-clickhouse-to-OBS
- Use AK/SK as the OBS access credentials.
Parameter
Description
storage_configuration.obs.obs_user_ak
AK of the IAM user in prerequisites
To obtain the AK information, refer to 6.
storage_configuration.obs.obs_user_sk
SK of the IAM user in prerequisites
To obtain the SK information, refer to 6.
Figure 2 Adding AK/SK to OBS credentials
- Configure an agency to obtain OBS access credentials.
- In the clickhouse-config-customize parameter, add custom configuration items based on the following table to configure OBS disk information.
Table 1 OBS disk parameters Parameter
Description
storage_configuration.disks.disk_s3.type
Fixed at s3
storage_configuration.disks.disk_s3.endpoint
Access path of the created OBS parallel file system. The format is as follows:
https://Parallel file system name.Parallel File Endpoint/Folder Name/
- Parallel file system name: Name of the OBS parallel file system you cerated
- Parallel file endpoint: To obtain the endpoint, click the OBS parallel file name and check the basic information on the overview page.
- Folder name: name of the folder created in the OBS parallel file system
disk_s3 indicates the OBS disk name, which can be customized.
Figure 3 Configuring an OBS disk with clickhouse-config-customize
- (Optional) Cache data stored on OBS to local storage to accelerate data query. To use the local cache, add configuration items to clickhouse-config-customize.
Parameter
Description
storage_configuration.disks.obs_cache.type
Fixed at cache
storage_configuration.disks.obs_cache.path
Path for caching data on a local disk, for example, /srv/BigData/data1/clickhouse/cache/
storage_configuration.disks.obs_cache.max_size
Cache size, for example, 10 Gi
storage_configuration.disks.obs_cache.disk
Name of the OBS storage disk, for example, disk_s3
obs_cache indicates the OBS cache name, which can be customized.
Figure 4 Configuring an OBS cache with clickhouse-config-customize
Configuring Disk Storage Policies
Configure disk storage policies on FusionInsight Manager.
- Log in to FusionInsight Manager, choose Cluster > ClickHouse > Instances, and click a ClickHouseServer. On the displayed page, click the config.xml file.
- In the config.xml file, obtain the value of disk of storage_configuration.
- Choose Cluster > ClickHouse > Configurations > All Configurations > ClickHouse(Service) and search for the _clickhouse.storage_configuration.policies parameter.
Set the following parameters listed in the table.
Configuration Item
Value
Description
storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.disk[1]
disk1
This is a configuration item for local disks. You can configure multiple local disks as you need. The value ranges from disk1 to diskn. For details about how to obtain the parameter value, see 2.
hot_cold_separation_policy indicates the name of a policy item, which can be customized.
hot_volume indicates the volume name in the hot data storage policy, which can be customized.
storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.disk[2]
disk2
storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.disk[n]
diskn
storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.priority
1
Priority of volumes in the hot data storage policy. If the value is greater than that of storage_configuration.policies.hot_cold_separation_policy.volumes.cold_obs_volume.priority, the hot data disk has a higher priority than the cold data disk.
storage_configuration.policies.hot_cold_separation_policy.volumes.cold_obs_volume.priority
0
Priority of volumes in the cold data storage policy. If the value is greater than that of storage_configuration.policies.hot_cold_separation_policy.volumes.hot_volume.priority, the cold data disk has a higher priority than the hot data disk.
storage_configuration.policies.hot_cold_separation_policy.volumes.cold_obs_volume.disk
OBS disk or cache name
cold_obs_volume indicates the volume name in the cold data OBS storage policy, which can be customized.
- If you do not use the local disk cache, set this parameter to the name of the OBS disk, for example, disk_s3.
- If you are using the local disk cache, set this parameter to the cache name, for example, obs_cache.
storage_configuration.policies.hot_cold_separation_policy.move_factor
0.2
move_factor indicates when data is moved to another volume. When the available space of a volume is less than move_factor, data is automatically moved to the next volume. In this example, when the available space of the hot volume is less than 20%, data in this volume is automatically moved to the cold volume. The value range is 0 to 1.
The following figure shows how to configure a storage policy.
- Save the configuration and restart ClickHouse.
Custom Cold-Hot Separation Policy
- Run the clickhouse client command to connect to the ClickHouseServer node. For details, see ClickHouse Client Practices.
- Create a ReplicatedMergeTree table for configuring the TTL for cold and hot storage.
ClickHouse supports table-level TTLs and allows you to set time-based rules to move data between specified disks or volumes. Data is stored at different storage layers.
The following is a table creation statement:CREATE TABLE example_table on cluster default_cluster ( d DateTime, a Int ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/default/example_table', '{replica}') PARTITION BY toYYYYMM(d) ORDER BY d TTL d + INTERVAL 3 YEAR, d + INTERVAL 1 YEAR TO VOLUME 'cold_obs_volume' SETTINGS storage_policy = 'hot_cold_separation_policy';
In this statement, the TTL configuration is in bold. This example deletes cold data older than three years from the example_table table, and moves cold data older than one year to OBS. The d column stores the time information about the data.
The TTL policy is specified using a simple SQL expression that contains the time and time interval. For example:
- The data will expire in three days since its creation (date_time).
TTL date_time + INTERVAL 3 DAY
- The data will expire one year later since its creation (date_time).
TTL date_time + INTERVAL 1 YEAR
The INTERVAL keyword supports second, minute, hour, day, week, month, quarter, and year.
storage_policy indicates the specified custom storage policy. In this example, the value is the policy name _clickhouse.storage_configuration.disks defined in 3.
- The data will expire in three days since its creation (date_time).
Importing Data into ClickHouse to Verify the Configuration
- Insert data for verification.
insert into example_table values('2023-12-27','10086'); -- hot data insert into example_table values('2023-12-26','10086'); -- hot data insert into example_table values('2022-12-24','10086'); -- cold data insert into example_table values('2022-11-24','10086'); -- cold data insert into example_table values('2018-10-01','10086'); -- deleted data insert into example_table values('2017-10-01','10086'); -- deleted data
- Query table data.
tesspmrB0002.mrs-dviw.com :) select * from example_table FORMAT CSV; SELECT * FROM example_table FORMAT CSV Query id: 3d6bfcb4-f082-4e97-8d95-f885e22ae689 "2023-12-27 00:00:00",10086 "2023-12-26 00:00:00",10086 "2022-11-24 00:00:00",10086 "2022-12-24 00:00:00",10086 4 rows in set. Elapsed: 0.037 sec.
Assume that current system time is April 2024. According to the TTL policy rule example_table, the data generated on 2018-10-01 and 2017-10-01 will be deleted.
- Query the storage paths of hot data and cold data.
select name, partition, active, path from system.parts where database =' database name'and table='table name' and active = 1;
Figure 5 Query result
Assume that the current system time is April 2024. Data older than one year (d column values: 202211 and 202212) in the example_table table is stored in the disk_s3 OBS bucket.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot