Help Center/ MapReduce Service/ Component Operation Guide (LTS)/ Using ClickHouse/ ClickHouse O&M Management/ Migrating ClickHouse Data from One MRS Cluster to Another
Updated on 2024-10-09 GMT+08:00

Migrating ClickHouse Data from One MRS Cluster to Another

This section applies only to MRS 3.2.0 or later.

Scenario

Scenario 1: As the number of MRS ClickHouse services increases, the storage and compute resources of clusters cannot meet service requirements. Thus, the clusters need to be split so that part of user service and database data can be migrated to new clusters.

Scenario 2: The data center where the backend hosts of MRS ClickHouse clusters are located needs to be migrated, so does the ClickHouse cluster data.

To meet the migration requirements, MRS provides a one-click data migration tool for migrating the databases, table objects (DDL), and service data of ClickHouse from a source cluster to the new cluster.

Migration Mechanism

  • Replicated*MergeTree table migration

    In this migration solution, ClickHouse uses ZooKeeper to automatically synchronize the data of Replicated*MergeTree tables of different replicas in the same shard. The logical procedure is as follows:

    First, add the ZooKeeper information of the source cluster to the configuration file of the destination cluster as the auxiliary ZooKeeper. Then, in the destination cluster, create a temporary table that has the same ZooKeeper path and structure as the source cluster but different replicas from it. After the temporary table is created, data in the source cluster will be automatically synchronized to the temporary table. Once data synchronization is complete, copy data from the temporary table to the formal table.
    Figure 1 Replicated*MergeTree table migration architecture
  • Distributed table migration

    During the migration of a replicated table, its metadata in the source cluster is exported and changed to the ZooKeeper path and replica of the destination cluster. Then, you can create a table in the destination cluster based on the modified metadata.

  • Non-replicated table and materialized view migration

    To migrate data in the non-replicated tables and materialized views, you can call the remote function.

The preceding migration operations are encapsulated using the migration tool script. In this way, you can modify the related configuration files and run the migration scripts to complete the migration by one click. For details, see the procedure description.

Prerequisites

  • The status of the ClickHouse cluster you want to migrate is normal, and the source and destination clusters must be both in security mode or normal mode. If the clusters are in normal mode, contact O&M personnel for migration.
  • A destination ClickHouse cluster of MRS 3.1.3 or later has been created for data migration. The cluster must be in security mode. The number of ClickHouserver instances in the ClickHouse cluster is greater than or equal to that of the source cluster.
  • Currently, logical clusters support only data migration to clusters with the same number of replicas.

Constraints

  • Only table data and table object metadata (DDL) can be migrated. SQL statements like ETL need to be manually migrated.
  • To ensure data consistency before and after the migration, stop the ClickHouse service of the source cluster before the migration. For details, see the procedure description.
  • If the table of the source cluster is deleted during the migration, this issue can only be solved manually.

Procedure

The overall migration procedure is as follows:
Figure 2 Migration flowchart
Table 1 Migration description

Step

Description

Step 1: Connect the source cluster to the destination cluster.

This step ensures that the source and target ClickHouse clusters as well as their nodes can communicate with each other.

Step 2: Add the ZooKeeper information of the source cluster to the configuration file of the destination cluster.

By doing so, ZooKeeper in the source cluster can be used as the auxiliary ZooKeeper during data migration.

Step 3: Migrate metadata of databases and tables in the source ClickHouse cluster to the destination cluster.

You can run the corresponding script to migrate metadata such as the database name, table name, and table structure of the ClickHouse database and tables in the source cluster to the destination cluster.

Step 4: Migrate data of the databases and tables in the source ClickHouse cluster to the destination cluster.

You can run the corresponding script to migrate the ClickHouse database and table data from the source cluster to the destination cluster.

Connecting the Source Cluster to the Destination Cluster

  1. Connect the source cluster to the destination cluster so that the ClickHouse instance nodes in the two clusters can communicate with each other.
  2. Add the hosts information of the source cluster to the configuration of all nodes in the destination cluster, and add the hosts information of the destination cluster to the configuration of all nodes in the source cluster.
    1. Log in to FusionInsight Manager of the source ClickHouse cluster, choose Cluster > ClickHouse, click the Instance tab, and view the service IP address of the ClickHouseServer instance node.
    2. Log in to any ClickHouseServer node using SSH and run the following command to check the host configurations of the ClickHouse instance in the source cluster:

      cat /etc/hosts

      The following figure shows the host configurations of the ClickHouse instance:

    3. Log in to FusionInsight Manager of the destination ClickHouse cluster, choose Cluster > ClickHouse, click the Instance tab, and view the service IP address of the ClickHouseServer instance node in the destination cluster.
    4. Log in to all ClickHouse nodes in the destination cluster as user root and run the following command to modify the /etc/hosts configuration of the nodes:

      vi /etc/hosts

      Copy the host information of the ClickHouse instance of the source cluster obtained in 2.b to the hosts file.

    5. Repeat 2.a to 2.d to add the node IP addresses of the destination cluster to the hosts files of the source cluster.
  3. Configure system mutual trust between the source cluster and the destination cluster. For details, see Configuring Cross-Manager Mutual Trust Between Clusters.

    If both the source and destination clusters are in normal mode, you do not need to configure mutual trust.

Adding the ZooKeeper Information of the Source Cluster to the Configuration File of the Destination Cluster

  1. Log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ZooKeeper, and click the Instance tab. On the displayed page, view the service IP addresses of the ZooKeeper quorumpeer instance , as shown in Figure 3.

    Figure 3 Addresses of the source Zookeeper quorumpeer instance
  2. Log in to FusionInsight Manager of the destination cluster, choose Cluster > Services > ClickHouse and click the Configurations tab and then All Configurations. On the displayed page, search for the clickhouse-config-customize parameter.
  3. Add ZooKeeper instance information of the source cluster to the clickhouse-config-customize parameter by referring to Table 2.
    Table 2 Configurations of clickhouse-config-customize

    Parameter

    Value

    auxiliary_zookeepers.zookeeper2.node[1].host

    Service IP address of the first ZooKeeper quorumpeer instance in the source cluster obtained in 1. Currently, only the IP address of the ZooKeeper instance can be configured.

    auxiliary_zookeepers.zookeeper2.node[1].port

    2181

    auxiliary_zookeepers.zookeeper2.node[2].host

    Service IP address of the second ZooKeeper quorumpeer instance in the source cluster obtained in 1. Currently, only the IP address of the ZooKeeper instance can be configured.

    auxiliary_zookeepers.zookeeper2.node[2].port

    2181

    auxiliary_zookeepers.zookeeper2.node[3].host

    Service IP address of the third ZooKeeper quorumpeer instance in the source cluster obtained in 1. Currently, only the IP address of the ZooKeeper instance can be configured.

    auxiliary_zookeepers.zookeeper2.node[3].port

    2181

  4. After the configuration is complete, click Save. In the displayed dialog box, click OK.
  5. Log in to any ClickHouseServer node of the destination cluster as user root. Run the following command to view the ClickHouseServer instance information:

    ps -ef |grep clickhouse

    Obtain the value of --config-file, that is, the configuration file directory of the ClickHouseServer, from the query result.
    Figure 4 Obtaining the configuration file directory of ClickHouseServer
  6. Run the corresponding command to check whether the information about <auxiliary_zookeepers> is added in the ClickHouse configuration file config.xml.

    cat Directory of the config.xml file obtained in 5

    • If both the source and destination clusters are in security mode, go to 7.
    • If both the source and destination clusters are in normal mode, go to 11.
    Figure 5 Viewing the added ZooKeeper information of the source cluster
  7. In the configuration file directory obtained in 5, run the following command to obtain the ZooKeeper authentication information of the source cluster:

    cat ENV_VARS | grep ZK

    Obtain the values of ZK_SERVER_FQDN, ZK_USER_PRINCIPAL and ZK_USER_REALM.

  8. Log in to FusionInsight Manager of the destination cluster, choose Cluster > Services > ClickHouse, click the Configurations tab and then All Configurations. In the navigation pane on the left, choose ClickHouseServer(Role) > Backup and set the parameters by referring to the following table.
    Table 3 Configuring the cluster authentication information

    Parameter

    Value

    AUXILIARY_ZK_SERVER_FQDN

    Value of ZK_SERVER_FQDN obtained in 7

    AUXILIARY_ZK_SERVER_PRINCIPAL

    Value of ZK_USER_PRINCIPAL obtained in 7

    AUXILIARY_ZK_SERVER_REALM

    Value of ZK_USER_REALM obtained in 7

    METADATA_COLLECTION_TIMEOUT

    180.

    This parameter specifies the timeout interval for waiting for the completion of metadata backup on other nodes, in seconds.

    Figure 6 Configuring the cluster authentication information
  9. Select Storage under ClickHouseServer(Role), and change the value of max_partition_size_to_drop and max_table_size_to_drop to 0.

  10. Click Save. In the dialog box that is displayed, click OK.
  11. On the ClickHouse service page, click the Instance tab. On this tab page, select the ClickHouseServer instance from the instance list, and choose More > Restart Instance in the Operation column to restart the ClickHouseServer instance.

Migrating Metadata of Databases and Tables in the Source ClickHouse Cluster to the Destination Cluster

  1. Log in to FusionInsight Manager of the source and destination clusters, and create the username and password required for the migration. The procedure is as follows:
    1. Log in to Manager and choose System > Permission > Role. On the displayed page, click Create Role.
    2. Specify Role Name, for example, ckrole. In the Configure Resource Permission area, click the cluster name. On the displayed service list page, click the ClickHouse service.
    3. Select SUPER_USER_GROUP and click OK.
    4. Choose System. On the navigation pane on the left, choose Permission > User and click Create.
    5. Select Human-Machine for User Type and set Password and Confirm Password to the password of the user.
      • Username: The username cannot contain hyphens (-). Otherwise, the authentication will fail.
      • Password: The password cannot contain special characters $, ., and #. Otherwise, the authentication will fail.
    6. In the Role area, click Add . In the displayed dialog box, select the role name in 1.b and click OK to add the role. Then, click OK.
    7. After the user is created, click the user name in the upper right corner to log out of the system. Log in to FusionInsight Manager as the new user and change its password as prompted.
  2. Download the ClickHouse client and install it as user omm to the destination cluster.
  3. Log in to the client node as user omm, go to the Client installation directory/ClickHouse/clickhouse_migration_tool/clickhouse-metadata-migration directory, and configure migration information. Run the following command to modify the example_config.yaml configuration file by referring to Table 4:

    cd Client installation directory/ClickHouse/clickhouse_migration_tool/clickhouse-metadata-migration

    vi example_config.yaml

    After the configuration is modified, you must delete all comment with number sign(#) and retain only valid configurations. Otherwise, an error may occur during script migration.

    Table 4 Parameters in the example_config.yaml file

    Configuration Item

    Sub-item

    Value and Description

    source_cluster

    host

    IP address of any ClickHouseServer node in the source cluster.

    cluster_name

    Name of the source ClickHouse cluster. You can log in to the ClickHouse client by referring to ClickHouse Client Practices and run the following command to obtain the value. If the source cluster name has not been changed, the default value is default_cluster.

    select cluster,shard_num,replica_num,host_name from system.clusters;

    https_port

    • For security mode, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations, and search for https_port.
    • For normal mode, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations, and search for http_port.

    zookeeper_root_path

    To obtain the value, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse, and click the Configurations tab and then All Configurations. In the displayed page, search for clickhouse.zookeeper.root.path.

    system

    System parameter. Retain the default value.

    databases

    Optional.

    • If this parameter is specified, data in the specified database of the source ClickHouse cluster is migrated. You can specify multiple databases. The following configuration is for your reference:
      databases:
          - "database"
          - "database_1"

      Data in the database and database_1 databases of the source cluster is migrated.

    • If this parameter is not specified, table data of all databases in the source ClickHouse cluster is migrated. Leave the databases parameter empty. The following is an example:
      databases:

      Table information of all databases in the source ClickHouse cluster is migrated.

    tables

    Optional. The value is in the format of Database name.Table name. The database name must be in the databases parameter list.

    • If this parameter is specified, data in specified tables in the source ClickHouse cluster database is migrated. You can configure multiple tables. The following configuration is for your reference:
      tables:
          - "database.table_1"
          - "database_1.table_2"

      Data in table_1 of database and table_2 of database_1 of the source cluster is migrated.

    • If this parameter is not specified and the databases parameter is specified, all table data in the databases database is migrated. If the databases parameter is not specified, all table data in all databases of the source ClickHouse cluster is migrated. The following configuration is for your reference:
      tables:

    destination_cluster

    host

    IP address of any ClickHouseServer node in the destination cluster.

    cluster_name

    Name of the destination ClickHouse cluster. You can log in to the ClickHouse client by referring to ClickHouse Client Practices and run the following command to obtain the value. If the destination cluster name has not been changed, the default value is default_cluster.

    select cluster,shard_num,replica_num,host_name from system.clusters;

    user

    Username created in 1 for logging in to FusionInsight Manager of the destination ClickHouse cluster.

    https_port

    • For security mode, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations, and search for https_port.
    • For normal mode, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations, and search for http_port.

    zookeeper_root_path

    To obtain the value, log in to FusionInsight Manager of the destination cluster, choose Cluster > Services > ClickHouse, and click the Configurations tab and then All Configurations. In the displayed page, search for clickhouse.zookeeper.root.path.

    system

    System parameter. Retain the default value.

  4. Run the following command to migrate data and wait until the script execution is complete:

    ./clickhouse_migrate_metadata.sh -f yaml_file

    Enter the usernames and passwords of the source and destination clusters.

If metadata migration fails, perform the following steps:

  1. Locate the failure cause. Specifically, check whether any parameters in the configuration file are incorrectly configured.
    • If yes, reconfigure the parameters and perform metadata migration.
    • If no, go to 2.
  2. Set the names of the tables that fail to be migrated in the metadata migration configuration file based on the databases and tables parameters in Table 4 and run the metadata migration command again. If the migration fails, contact O&M personnel.

Migrating Data of the Databases and Tables in the Source ClickHouse Cluster to the Destination Cluster

  1. Log in to the ClickHouse client node in the destination cluster as user omm and go to Client installation directory/ClickHouse/clickhouse_migration_tool/clickhouse-data-migration.

    cd Client installation directory/ClickHouse/clickhouse_migration_tool/clickhouse-data-migration

  2. Run the following command to modify the example_config.yaml configuration file by referring to Table 5:

    vi example_config.yaml

    After the configuration is modified, you must delete all comment with number sign(#) and retain only valid configurations. Otherwise, an error may occur during script migration.
    Table 5 Parameters in example_config.yaml

    Configuration Item

    Sub-item

    Value and Description

    source_cluster

    host

    IP address of any ClickHouseServer node in the source cluster.

    cluster_name

    Name of the source ClickHouse cluster. You can log in to the ClickHouse client by referring to ClickHouse Client Practices and run the following command to obtain the value. If the source cluster name has not been changed, the default value is default_cluster.

    select cluster,shard_num,replica_num,host_name from system.clusters;

    user

    Username created in 1 for logging in to FusionInsight Manager of the source ClickHouse cluster.

    https_port

    • For security mode, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations, and search for https_port.
    • For normal mode, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations, and search for http_port.

    tcp_port

    To obtain the value, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse, and click the Configurations tab and then All Configurations. In the displayed page, search for tcp_port_secure if the cluster is in security mode. Otherwise, search for tcp_port.

    zookeeper_root_path

    To obtain the value, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse, and click the Configurations tab and then All Configurations. In the displayed page, search for clickhouse.zookeeper.root.path.

    system

    System parameter. Retain the default value.

    databases

    Optional.

    • If this parameter is specified, data in the specified database of the source ClickHouse cluster is migrated. You can specify multiple databases. The following configuration is for your reference:
      databases:
          - "database"
          - "database_1"

      Data in the database and database_1 databases of the source cluster is migrated.

    • If this parameter is not specified, table data of all databases in the source ClickHouse cluster is migrated. Leave the databases parameter empty. The following is an example:
      databases:

      Table information of all databases in the source ClickHouse cluster is migrated.

    tables

    Optional. The value is in the format of Database name.Table name. The database name must be in the databases parameter list.

    • If this parameter is specified, data in specified tables in the source ClickHouse cluster database is migrated. You can configure multiple tables. The following configuration is for your reference:
      tables:
          - "database.table_1"
          - "database_1.table_2"

      Data in table_1 of database and table_2 of database_1 of the source cluster is migrated.

    • If this parameter is not specified and the databases parameter is specified, all table data in the databases database is migrated. If the databases parameter is not specified, all table data in all databases of the source ClickHouse cluster is migrated. The following configuration is for your reference:
      tables:

    destination_cluster

    host

    IP address of any ClickHouseServer node in the destination cluster.

    cluster_name

    Name of the destination ClickHouse cluster. You can log in to the ClickHouse client by referring to ClickHouse Client Practices and run the following command to obtain the value. If the destination cluster name has not been changed, the default value is default_cluster.

    select cluster,shard_num,replica_num,host_name from system.clusters;

    user

    Username created in 1 for logging in to FusionInsight Manager of the destination ClickHouse cluster.

    https_port

    • For security mode, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations, and search for https_port.
    • For normal mode, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ClickHouse > Configurations > All Configurations, and search for http_port.

    tcp_port

    To obtain the value, log in to FusionInsight Manager of the destination cluster, choose Cluster > Services > ClickHouse, and click the Configurations tab and then All Configurations. In the displayed page, search for tcp_port_secure if the cluster is in security mode. Otherwise, search for tcp_port.

    zookeeper_root_path

    To obtain the value, log in to FusionInsight Manager of the destination cluster, choose Cluster > Services > ClickHouse, and click the Configurations tab and then All Configurations. In the displayed page, search for clickhouse.zookeeper.root.path.

    system

    System parameter. Retain the default value.

    auxiliary_zookeepers

    name

    ZooKeeper name of the source ClickHouse cluster configured in 3, for example, zookeeper2.

    hosts

    IP address of the ZooKeeper instance of the source ClickHouse. To obtain the IP address, log in to FusionInsight Manager of the source cluster, choose Cluster > Services > ZooKeeper, and click the Instance tab. On the displayed page, view the service IP addresses of the ZooKeeper quorumpeer instance , as shown in Figure 3.

    The format is as follows:
    hosts:
        - "192.168.1.2"
        - "192.168.1.3"
        - "192.168.1.4"

    port

    2181

    execution_procedure

    -

    This parameter is left blank by default, indicating that the script is executed once to synchronize service data. Value options are firststep and secondstep.

    • firststep: Only the temporary replication table is created. The auxiliary ZooKeeper can synchronize data from the original cluster to the temporary table in real time.
    • secondstep: data in the temporary replication table is attached to the local table of the destination cluster.
      CAUTION:

      If this parameter is set to secondstep, O&M personnel and users need to confirm that ClickHouse-related services have been stopped before script execution.

    onereplica_use_auxiliaryzookeeper

    -

    • If this parameter is set to 1, temporary tables are created for only one replica of each shard.
    • If this parameter is set to 0, temporary tables are created for two replicas of each shard.
  3. Stop the ClickHouse service of the source cluster.
  4. Run the following command to migrate data and wait until the script execution is complete:

    ./clickhouse_migrate_data.sh -f yaml_file

    Enter the usernames and passwords of the source and destination clusters.

  5. After the script is executed successfully, perform the following steps to check whether the migrated data in the source cluster is consistent with that in the destination cluster based on the migration result logs:

    Log in to the ClickHouse client node in the destination cluster and go to the Client installation directory/ClickHouse/clickhouse_migration_tool/clickhouse-data-migration/comparison_result directory.

    Compare the following result file information to check the data consistency between the source cluster and the destination cluster:

    • source_cluster_table_info: statistics of data migrated from the source cluster
    • destination_cluster_table_info: statistics of data migrated to the destination cluster
    • compare_result_file.txt: data consistency comparison result before and after migration

    If the data is inconsistent before and after the migration, clear the data in the table of the destination cluster and migrate the data in the table separately or manually.

    In addition, you can log in to the ClickHouse databases of the source and destination clusters to manually check whether the number of table data records and partitions are consistent.

  6. Log in to FusionInsight Manager of the destination cluster and delete the ZooKeeper information added to clickhouse-config-customize in 2.

    Click Save. In the displayed dialog box, click OK.

  7. After data migration is complete, switch services to the target ClickHouse cluster.
  8. Go to Client installation directory/ClickHouse/clickhouse_migration_tool/clickhouse-data-migration and Client installation directory/ClickHouse/clickhouse_migration_tool/clickhouse-metadata-migration on the ClickHouse node in the destination cluster.

    vi example_config.yaml

    Delete the password from the configuration file to prevent password leakage.

If service data migration fails, perform the following steps:

  1. Locate the failure cause. Specifically, check whether any parameters in the configuration file are incorrectly configured.
    • If yes, reconfigure the parameters and perform service data migration.
    • If no, go to 2.
  2. Run the drop table table_name command to delete the data tables related to the table from the node that fails to be migrated in the destination cluster.
  3. Run the show create table table_name command to query the table creation statements related to the table in the source cluster and create the table in the destination cluster again.
  4. Set the names of the tables that fail to be migrated in the service data migration configuration file based on the databases and tables parameters in Table 5 and run the service data migration command again. If the command fails to execute, contact O&M personnel.