Updated on 2024-11-29 GMT+08:00

Using Reindex to Migrate Data

Scenario

As a cross-cluster data copying tool, Reindex migrates data of multiple indexes using the reindex API. This tool is used to copy and migrate data between two Elasticsearch clusters to implement convenient data reuse and security management.

  • Data in open source Elasticsearch 7.0.0 to 7.10.2 can be migrated to Elasticsearch clusters.
  • Based on the open source Elasticsearch 7.10.2, MRS Elasticsearch supports data migration between Elasticsearch of the 7.10.2 kernel version.
  • If the versions of the source and target MRS clusters are different, the two clusters must be consistent in whether type is supported. For example, both clusters support type or neither of them supports type. By default, kernel 6.7.1 and earlier versions support type, and kernel 7.10.2 does not support type.

Prerequisites

  • The target Elasticsearch cluster is available.
  • The index of the source Elasticsearch cluster has been created.
  • The port and network communications between the source and target Elasticsearch cluster are normal.
  • The upstream service has stopped the write operations on the source cluster to ensure data consistency after data migration, while the read operations can be normally conducted. After the migration is complete, switch to the target cluster to read and write data. If the write operation is not stopped, data may be inconsistent.
  • If both the source and destination clusters are in security mode, cross-cluster mutual trust needs to be configured.
  • The Elasticsearch client has been installed in a directory, for example, /opt/client.

Procedure

Modify cluster configurations.

  1. Add a remote whitelist to the target cluster.

    1. On Manager, choose Cluster > Name of the desired cluster > Elasticsearch > Configurations > All Configurations.
    2. Choose Elasticsearch > Self-Definition, modify the value of elasticsearch.customized.configs, and add the reindex.remote.whitelist parameter to add the hosts in the source cluster to the whitelist.

      The value is the combination list of hosts in the source cluster in the format of host:port, separated by comma (,).

      Example: 10.131.112.121:*,10.131.112.122:*,10.131.112.*:*,localhost:*

  2. In the source and target clusters, disable the function of disabling the Transport Layer Security (TLS) protocol of the earlier version at the secure transport layer. After the migration is complete, restore the configuration.

    1. If both the source and target clusters are in normal mode, skip this step and go to 3.
    2. On Manager, choose Cluster > Name of the desired cluster > Elasticsearch > Configurations > All Configurations.
    3. Choose Elasticsearch > Security and change the value of DISABLE_TLS_LOW_PROTOCOL to false.

  3. Configure the authentication mode parameter reindex.ssl.verification_mode in the target cluster.

    1. If both the source cluster and destination cluster are configured in normal mode, ignore this step and go to the next step.
    2. Otherwise, choose Elasticsearch > Self-Definition, modify the value of elasticsearch.customized.configs, add the reindex.ssl.verification_mode parameter, and set the parameter value according to the following table: Restore the configuration after migration.
      Table 1 Authentication mode configuration

      Mode of Source Cluster

      Mode of Target Cluster

      Authentication Mode Configuration

      Normal mode

      Normal mode

      No need to add the reindex.ssl.verification_mode parameter.

      Security mode

      Normal mode

      Set reindex.ssl.verification_mode to none.

      Normal mode

      Security mode

      Set reindex.ssl.verification_mode to none.

      Security mode

      Security mode

      Set reindex.ssl.verification_mode to none.

  4. Save the configuration and restart the Elasticsearch service for the configuration to take effect.

Modify configuration files.

  1. Check whether either the source cluster or the target cluster is in security mode. On Manager, choose Cluster > Name of the desired cluster > Services > Elasticsearch > Configurations. Search for the ELASTICSEARCH_SECURITY_ENABLE parameter, check whether it is searchable, and if it is, check whether the value is true. true indicates that the security mode is enabled.

    • If yes, go to 6.
    • If no, go to 7.

  2. Upload the user authentication configuration file according to Table 2. If the source cluster is in security mode and the target cluster is in normal mode, upload the user authentication configuration file of the source cluster. If the source cluster is in normal mode and the target cluster is in security mode, upload the user authentication configuration file of the target cluster.

    1. Create a role, for example, ES_Role. For details, see Authentication Based on Users and Roles.
    2. Create a user, initialize the password, and change the password.
      1. On Manager, choose System > Permission > User > Create.
      2. Enter the username, for example, esuser. (If both the source and target clusters are in security mode, ensure that the usernames of both clusters are the same.) Add user esuser to the elasticsearch and supergroup user groups, set the primary group to supergroup, and bind the ES_Role role to the user to obtain related permissions.
      3. Choose System > Permission > User. In the displayed page, choose More > Initialize Password in the Operation column of the new user. Change the password for the first login after the password is initialized.
    3. Choose System > Permission > User. Locate the newly created user and choose More > Download Authentication Credential. Then select the cluster information, and click OK to download the file.
    4. Upload the user.keytab and krb5.conf files obtained after the decompression to the specified directory of the Reindex migration tool. For details, see Table 2.
      Table 2 Path for storing authentication files

      Mode of Source Cluster

      Mode of Target Cluster

      Directory for User Authentication File

      Normal mode

      Normal mode

      N/A

      Security mode

      Normal mode

      1. The remote directory in the Elasticsearch/tools/elasticsearch-data2es/es2es_reindex/conf directory of the Elasticsearch client is created.

        Example:

        mkdir /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_reindex/conf/remote

      2. user.keytab and krb5.conf files of the source cluster are uploaded to the remote directory.
      3. Create keytabPath in Table 3 to which user.keytab and krb5.conf files of the source cluster are uploaded.

      Normal mode

      Security mode

      user.keytab and krb5.conf files of the target cluster are uploaded to the Elasticsearch/tools/elasticsearch-data2es/es2es_reindex/conf directory on the Elasticsearch client.

      Security mode

      Security mode

      1. Cross-cluster mutual trust must be configured between the target cluster and the source cluster.
      2. user.keytab and krb5.conf files of the target cluster are uploaded to the Elasticsearch/tools/elasticsearch-data2es/es2es_reindex/conf directory on the Elasticsearch client.
      • All third-party Elasticsearch clusters are in normal mode.
      • If the target cluster is in security mode and other users need to read or write index data from the target cluster, corresponding permissions should be assigned to the users. For details about the permissions, see Authentication Based on Users and Roles.

  3. Modify the espara.properties parameter configuration file in the conf directory of the tool package. The command is as follows:

    cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_reindex/conf

    vi espara.properties

    For details about parameter configuration, see Table 3. The following is a configuration example:

    #################### destination: FusionInsight Elasticsearch #############################
    esServerHost=ip1:port1,ip2:port2,ip3:port3
    # unit:s
    socketTimeout=30
    connectTimeout=30
    connectRequestTimeout=30
    # Whether FusionInsight Elasticsearch cluster is in secure mode. true: secure, false: normal
    isSecureMode=true
    # FI ES cluster authentication user
    principal=esuser@<System domain name>
    
    
    ###################### source: old cluster #####################################
    # Old cluster host must be '[scheme]://[host]:[port]', e.g. 'http://10.10.10.10:9200'
    oldClusterHost=http://ip:port
    # Whether old cluster is FusionInsight Elasticsearch secure mode. true: secure, false: normal. NOTE: Third-party Elasticsearch is false
    isOldSecureMode=false
    # Old cluster authentication user, configured only when isOldSecureMode is true
    oldPrincipal=
    # Old cluster needs Basic Authentication
    oldClusterUser=
    oldClusterPass=
    # unit:s
    oldCluster_socketTimeout=30
    oldCluster_connectTimeout=30
    oldCluster_connectRequestTimeout=30
    
    
    ###################### other config ######################################
    # Reindexing from a remote server uses an on-heap buffer that defaults to a maximum size of 100mb.
    # If the remote index includes very large documents you'll need to use a smaller batch size.
    batchSize=1000
    # indices support all and specified index
    # all: all indexes of old cluster in open state
    # specified index: support batches, separated by commas. e.g. index1,index2,index3
    indices=index1,index2,index3
    # Whether to create an index automatically.
    # true: create an index automatically, copy settings and mapping of the old cluster index.
    # false: need to create an index in advance and specify settings and mapping
    copySettingAndMapping=true
    # Whether exists type
    isTypeExist=false
    # Number of indexes for batch data migration
    threadNum=24
    
    # time field of indices.
    timeField=
    beginTime=
    endTime=
    
    # secure keytab path. only config when source is Elasticsearch secure mode and target is Elasticsearch  normal.
    # need to put the keytab files in target Elasticsearch (esServerHost) server.
    keytabPath=/home/omm/conf
    Table 3 espara.properties parameter description

    Parameter

    Default Value

    Description

    esServerHost

    ip1:port1,ip2:port2,ip3:port3

    Specifies the instance configuration for Elasticsearch in the target cluster. To facilitate load balancing, multiple groups of IP addresses and port numbers are configured. The parameter value is in the IP address:Port number format. Multiple groups of IP address and port number are separated by comma (,). The IP address is the IP address of the service plane of any DataNode in the Elasticsearch cluster, and the port number is the external HTTP(S) port number of the node.

    To view the instance configuration, log in to Manager, choose Cluster > Name of the desired cluster > Services > Elasticsearch > Configurations, and search for the INSTANCE_SERVER_PORT_LIST parameter.

    NOTE:

    Do not obtain this parameter value from the EsMaster instances.

    socketTimeout

    30

    Specifies the socket timeout of the target cluster. The unit is second.

    connectTimeout

    30

    Specifies the connection timeout of the target cluster. The unit is second.

    connectRequestTimeout

    30

    Specifies the connection request timeout of the target cluster. The unit is second.

    isSecureMode

    false

    Specifies whether the target Elasticsearch cluster is in security mode. true indicates that the cluster is in security mode, and false indicates that the cluster is in normal mode.

    principal

    N/A

    Specifies the authentication user name of the target Elasticsearch cluster. Set this parameter if isSecureMode is set to true. The value is in the username@domain name of the target cluster format.

    oldClusterHost

    http://ip1:port1

    Specifies the instance configuration for instances in the source Elasticsearch cluster. The value is in the [scheme]://[host]:[port] format. host indicates the IP address of the service plane of any DataNode in the source Elasticsearch cluster, and port indicates the external HTTP(S) port number of the node, for example, http://10.10.10.10:9200.

    isOldSecureMode

    false

    Specifies whether the source cluster is in security mode. true indicates that the cluster is in security mode, and false indicates that the cluster is in normal mode. This value for the third-party Elasticsearch is false.

    oldPrincipal

    N/A

    Specifies the authentication user of the source cluster. Set this parameter if isOldSecureMode is set to true. The value is in the username@domain name of the source cluster format.

    oldClusterUser

    N/A

    Specifies the basic authentication username. Set this parameter if the basic authentication is required for the source cluster.

    oldClusterPass

    N/A

    Specifies the password of the basic authentication username. Set this parameter if the basic authentication is required for the source cluster.

    oldCluster_socketTimeout

    30

    Specifies the socket timeout of the source cluster. The unit is second.

    oldCluster_connectTimeout

    30

    Specifies the connection timeout of the source cluster. The unit is second.

    oldCluster_connectRequestTimeout

    30

    Specifies the connection request timeout of the source cluster. The unit is second.

    batchSize

    1000

    By default, a stack buffer with the maximum size of 100 MB is used for the re-index from the source cluster. If the source cluster index contains a large number of documents, a smaller size for batch processing is recommended. Unit: records

    indices

    index1,index2

    Specifies the migration index name, which can be set to a specified index. You are advised to migrate indexes one by one. Indexes can be migrated in batches. Use commas (,) to separate multiple indexes, for example, index1,index2,index3.

    copySettingAndMapping

    true

    Specifies whether to automatically create indexes. If the value is true, indexes are automatically created, and the settings as well as mapping of the source cluster indexes are copied. In the settings, the number of primary shards and analysis is copied, the number of replica shards is set to 0 to accelerate indexing, and the refresh interval is set to -1.

    isTypeExist

    false

    Whether the Elasticsearch version in the source cluster contains the type field. If yes, set this parameter to true. If no, set this parameter to false.

    threadNum

    24

    Specifies the size of the thread pool, indicating the number of indexes for data migration at the same time.

    timeField

    N/A

    Specifies the time field in an index. Only data within the period from startTime to endTime is migrated. The time range is [startTime, endTime). If this field is not set, all data of the index is migrated.

    NOTE:

    The formats of the startTime and endTime must be the same as those in the index. Otherwise, the migration fails.

    If multiple indexes are configured, all indexes must have this time field and the time formats must be the same. Otherwise, the migration fails.

    beginTime

    N/A

    Specifies the start time of the time field for a migration index.

    endTime

    N/A

    Specifies the end time of the time field for a migration index.

    keytabPath

    /home/omm/conf

    Specifies the authentication file path. user.keytab and krb5.conf files of source cluster users need to be uploaded to the path specified by this parameter as user omm.

    NOTE:
    1. This parameter needs to be set only when data is exported from the Elasticsearch in security mode to the Elasticsearch in normal mode.
    2. You need to create this directory on all servers involved in the target cluster esServerHost and upload the user.keytab and krb5.conf files of the source cluster users.

    You can view the cluster domain name on the cluster page. Log in to Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain.

Import data from the source cluster to target cluster.

  1. Run the following command to configure environment variables:

    source /opt/client/bigdata_env

  2. Run the following commands to migrate data:

    cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_reindex

    java -cp ./../lib/*:./conf/ com.*.fusioninsight.es.tool.reindex.ReindexTool

  3. Run the curl command to view the index and check whether the data is imported.

    • Security mode:

      curl -XGET --tlsv1.2 --negotiate -k -u : "https://IP:port/indexname/_search?pretty"

      curl -XGET --tlsv1.2 --negotiate -k -u : "https://IP:port/_cat/indices?v"| grep indexname

    • Normal mode

      curl -XGET "http://IP:port/indexname/_search?pretty"

      curl -XGET "http://IP:port/_cat/indices?v"| grep indexname

    For details about how to use the curl command, see Running curl Commands in Linux.

  4. If the value of docs.count is 0, the index has not yet been migrated. Check the migration status.

    • Security mode:

      curl -XGET --tlsv1.2 --negotiate -k -u : 'https://IP:port/_tasks?detailed=true&actions=*reindex*&pretty'

    • Normal mode

      curl -XGET 'http://IP:port/_tasks?detailed=true&actions=*reindex*&pretty'

    The response is as follows:

    {
    "nodes" : {
    ……
          "tasks" : {
            "R9QEdSqcQkSdkqGzinrvAA:55049" : {
              "node" : "R9QEdSqcQkSdkqGzinrvAA",
              "id" : 55049,
              "type" : "transport",
              "action" : "indices:data/write/reindex",
              "status" : {
                "total" : 59948183,
                "updated" : 0,
                "created" : 8950000,
                "deleted" : 0,
                "batches" : 895,
                "version_conflicts" : 0,
                "noops" : 0,
                "retries" : {
                  "bulk" : 0,
                  "search" : 0
                },
    ......
     }
    }

    In the preceding command output, R9QEdSqcQkSdkqGzinrvAA:55049 indicates the task ID.

    To view the task ID of the executed index, perform the following steps:

    Search for log file logs/estool.log. The search content is index name + space + reindex taskID:, Example: cat estool.log | grep 'index1 reindex taskID:'

    The command output is index1 reindex taskID: R9QEdSqcQkSdkqGzinrvAA:68993, where R9QEdSqcQkSdkqGzinrvAA:68993 indicates the task ID.

  5. Query the task ID.

    • Security mode:

      curl -XGET --tlsv1.2 --negotiate -k -u : 'https://IP:port/_tasks/tastID?pretty'

    • Normal mode

      curl -XGET 'http://IP:port/_tasks/tastID?pretty'

    The response is as follows:

    {
    "completed" : true,
    "task" : {
        "node" : "R9QEdSqcQkSdkqGzinrvAA",
        "id" : 55049,
        "type" : "transport",
        "action" : "indices:data/write/reindex",
        "status" : {
          "total" : 59948183,
          "updated" : 0,
          "created" : 59948183,
          "deleted" : 0,
          "batches" : 5998,
          "version_conflicts" : 0,
          "noops" : 0,
          "retries" : {
            "bulk" : 0,
            "search" : 0
          },
          "throttled_millis" : 0,
          "requests_per_second" : -1.0,
          "throttled_until_millis" : 0
        },
        "description" : "reindex from [host=10.6.6.1 port=9200 query={\n  \"match_all\" : { }\n}][test1][order_list] to [test1]",
        "start_time_in_millis" : 1546078390223,
        "running_time_in_nanos" : 2252468434895,
        "cancellable" : true
     },
    ......
    }

    If the value of completed is true, the index migration is complete.

    • total: the total number of docs of indexes.
    • created: the number of docs created by indexes.
    • running_time_in_nanos: migration duration. The unit is nanosecond.

    In the Elasticsearch/tools/elasticsearch-data2es/es2es_reindex/logs directory of the Elasticsearch client, you can view the log information about the import process.

  6. Cancel a task.

    • Security mode:

      curl -XPOST --tlsv1.2 --negotiate -k -u : 'https://IP:port/_tasks/tastID/_cancel?pretty'

    • Normal mode

      curl -XPOST 'http://IP:port/_tasks/tastID/_cancel?pretty'

  7. If the index data volume of the target cluster is the same as that of the source cluster, the index has been migrated. Set the number of replicas and the refresh interval.

    For example, set the number of index replicas to 1 and the refresh interval to 60s.

    • Security mode:
      curl -XPUT --tlsv1.2 --negotiate -k -u 'https://IP:port/indexName/_settings' -H 'Content-Type: application/json' -d' {
      "number_of_replicas" : 1,
      "refresh_interval" : "60s"
      }'
    • Normal mode
      curl -XPUT 'http://IP:port/indexName/_settings' -H 'Content-Type: application/json' -d' {
      "number_of_replicas" : 1,
      "refresh_interval" : "60s"
      }'

  8. Check the index status of the target cluster. If the status of all indexes is green, replica data synchronization is complete.

    • Security mode:

      curl -XGET --negotiate -k -u : 'https://ip:port/_cat/indices?v'

    • If the security mode is used, run the curl -XGET 'http://ip:port/_cat/indices?v' command.

  9. Use the .task index to view details about an index migration task. After confirming that the cluster data is migrated correctly, delete the index.

    • Security mode:

      curl -XDELETE --negotiate -k -u : 'https://ip:port/.tasks?pretty'

    • Normal mode

      curl -XDELETE 'http://ip:port/.tasks?pretty'

      Exercise caution. Do not mistakenly delete other indexes.

  10. After all data is migrated, delete the reindex.ssl.verification_mode configuration from the target cluster.

    1. If both the source and target clusters are in normal mode, you do not need to delete the configuration. Go to the next step.
    2. On Manager, choose Cluster > Elasticsearch > Configurations > All Configurations.
    3. Choose Elasticsearch > Customization and delete the reindex.ssl.verification_mode parameter from the value of elasticsearch.customized.configs.
    4. Save the configuration and restart the Elasticsearch services of the source and target clusters for the configuration to take effect.

  11. After all data is migrated, enable the function of disabling the earlier TLS version at the security transport layer in the source and target clusters.

    1. If both the source and target clusters are in normal mode, you do not need to enable the function of disabling the earlier TLS version at the security transport layer. Ignore this step.
    2. On Manager, choose Cluster > Name of the desired cluster > Elasticsearch > Configurations > All Configurations.
    3. Choose Elasticsearch > Security and change the value of DISABLE_TLS_LOW_PROTOCOL to true.
    4. Save the configuration and restart the Elasticsearch service for the configuration to take effect.