Updated on 2024-11-29 GMT+08:00

Using Scroll to Migrate Data

Scenario

As a tool copying cross-cluster data, Scroll migrates data of one index each time by using the rolling traversal policy and bulkAPI. This tool is used to copy and migrate data between two Elasticsearch clusters to implement convenient data reuse and security management.

  • Scroll can migrate data of Elasticsearch 7.0.0 to 7.10.2 to Elasticsearch clusters.
  • Based on the open-source 7.10.2 version, MRS Elasticsearch supports data migration between Elasticsearch of the same version.

Prerequisites

  • The target Elasticsearch cluster is available.
  • The index of the source Elasticsearch cluster has been created.
  • The port and network communications between the source and target Elasticsearch cluster are normal.
  • The upstream service has stopped the write operations on the source cluster to ensure data consistency after data migration, while the read operations can be normally conducted. After the migration is complete, switch to the target cluster to read and write data. If the write operation is not stopped, data may be inconsistent.
  • If both the source and destination clusters are in security mode, cross-cluster mutual trust needs to be configured.
  • The Elasticsearch client has been installed in a directory, for example, /opt/client.

Procedure

Modify configuration files.

  1. On Manager, choose Cluster > Name of the desired cluster > Services > Elasticsearch > Configurations. Search for the ELASTICSEARCH_SECURITY_ENABLE parameter, check whether it is searchable, and if it is, check whether the value is true. true indicates that the security mode is enabled.

    • If yes, go to 2.
    • If no, go to 3.

  2. Upload the user authentication configuration file.

    1. Create a role, for example, ES_Role. For details, see Authentication Based on Users and Roles.
    2. Create a user.
      1. On Manager, choose System > Permission > User > Create.
      2. Enter a username, for example, test. Add user test to the elasticsearch and supergroup user groups and set the primary group to supergroup.
      3. Click OK.
    3. Choose System > Permission > User. Locate the newly created user and choose More > Download Authentication Credential. Then select the cluster information, and click OK to download the file.
    4. Upload the user.keytab and krb5.conf files obtained after the decompression to the specified directory of the Scroll migration tool. For details, see Table 1.
      Table 1 Path for storing authentication files

      Mode of Source Cluster

      Mode of Target Cluster

      Directory for User Authentication File

      Normal mode

      Normal mode

      None

      Security mode

      Normal mode

      • The remote directory in the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf directory of the Elasticsearch client is created.

        Example:

        mkdir /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf/remote

      • user.keytab and krb5.conf files of the source cluster are uploaded to the remote directory.

      Normal mode

      Security mode

      user.keytab and krb5.conf files of the target cluster are uploaded to the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf directory on the Elasticsearch client.

      Security mode

      Security mode

      • Cross-cluster mutual trust must be configured between the target cluster and the source cluster.
      • user.keytab and krb5.conf files of the target cluster are uploaded to the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf directory on the Elasticsearch client.
      • All third-party Elasticsearch clusters are in normal mode.
      • If the target cluster is in security mode and other users need to read or write index data from the target cluster, corresponding permissions should be assigned to the users. For details about the permissions, see Authentication Based on Users and Roles.

  3. Run the following command to modify the espara.properties parameter configuration file in the conf directory of the tool package:

    cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf

    vi espara.properties

    For details about parameter configuration, see Table 2. The following is a configuration example:

    #################### destination: FusionInsight Elasticsearch #############################
    esServerHost=ip1:port1,ip2:port2,ip3:port3
    # unit:s
    socketTimeout=30
    connectTimeout=30
    connectRequestTimeout=30
    maxRetryTimeoutMillis=60
    # Whether FI ES cluster is in secure mode. true: secure, false: normal
    isSecureMode=false
    # FI ES cluster authentication user
    principal=
    
    ###################### source: old cluster #####################################
    oldClusterHost=ip1:port1,ip2:port2,ip3:port3
    # Whether old cluster is FI ES secure mode. true: secure, false: normal. NOTE: Third-party ES is false
    isOldSecureMode=false
    # Old cluster authentication user, configured only when isOldSecureMode is true
    oldPrincipal=
    # Old cluster needs Basic Authentication
    oldClusterUser=
    oldClusterPass=
    # unit:s
    oldCluster_socketTimeout=30
    oldCluster_connectTimeout=30
    oldCluster_connectRequestTimeout=30
    oldCluster_maxRetryTimeoutMillis=60
    
    
    ###################### other config ######################################
    # number of documents at a time: ie "size" in the scroll request (10000)
    batchSize=10000
    # The scroll parameter tells Elasticsearch to keep the search context open, unit: minute
    scrollTime=10
    # index to migrate
    index=myindex
    # Whether to create an index automatically.
    # true: create an index automatically, copy settings and mapping of the old cluster index.
    # false: need to create an index in advance and specify settings and mapping
    copySettingAndMapping=true
    # whether to copy _id of index
    copyId=false
    # thread pool size
    threadNum=24
    
    # time field of indices.
    timeField=
    beginTime=
    endTime=
    Table 2 espara.properties parameter description

    Parameter

    Default Value

    Description

    esServerHost

    ip1:port1,ip2:port2,ip3:port3

    Specifies the instance configuration for Elasticsearch in the target cluster. To facilitate load balancing, multiple groups of IP addresses and port numbers are configured. The parameter value is in the IP address:Port number format. Multiple groups of IP address and port number are separated by comma (,). The IP address is the IP address of the service plane of any DataNode in the Elasticsearch cluster, and the port number is the external HTTP(S) port number of the node.

    To view the instance configuration, log in to Manager, choose Cluster > Name of the desired cluster > Services > Elasticsearch > Configurations, and search for the INSTANCE_SERVER_PORT_LIST parameter.

    NOTE:

    Do not obtain this parameter value from the EsMaster instances.

    socketTimeout

    30

    Specifies the socket timeout of the target cluster. The unit is second.

    connectTimeout

    30

    Specifies the connection timeout of the target cluster. The unit is second.

    connectRequestTimeout

    30

    Specifies the connection request timeout of the target cluster. The unit is second.

    maxRetryTimeoutMillis

    60

    Specifies the retry timeout of the target cluster. The unit is second.

    isSecureMode

    false

    Specifies whether the target Elasticsearch cluster is in security mode. true indicates that the cluster is in security mode, and false indicates that the cluster is in normal mode.

    principal

    N/A

    Specifies the authentication user name of the target Elasticsearch cluster. Set this parameter if isSecureMode is set to true. The value is in the username@domain name of the target cluster format.

    oldClusterHost

    ip1:port1,ip2:port2,ip3:port3

    Specifies the instance configuration for instances in the source Elasticsearch cluster. The parameter value is in the IP address:Port number format. Multiple groups of IP address and port number are separated by comma (,). The IP address is the IP address of the service plane of any DataNode in the source Elasticsearch cluster, and the port number is the external HTTP(S) port number of the node.

    isOldSecureMode

    false

    Specifies whether the source cluster is in security mode. true indicates that the cluster is in security mode, and false indicates that the cluster is in normal mode. This value for the third-party Elasticsearch is false.

    oldPrincipal

    N/A

    Specifies the authentication user of the source cluster. Set this parameter if isOldSecureMode is set to true. The value is in the username@domain name of the source cluster format.

    oldClusterUser

    N/A

    Specifies the basic authentication username. Set this parameter if the basic authentication is required for the source cluster.

    NOTE:

    Basic authentication is required if the source cluster is an open-source cluster in security mode.

    oldClusterPass

    N/A

    Specifies the password of the basic authentication username. Set this parameter if the basic authentication is required for the source cluster.

    NOTE:

    Basic authentication is required if the source cluster is an open-source cluster in security mode.

    oldCluster_socketTimeout

    30

    Specifies the socket timeout of the source cluster. The unit is second.

    oldCluster_connectTimeout

    30

    Specifies the connection timeout of the source cluster. The unit is second.

    oldCluster_connectRequestTimeout

    30

    Specifies the connection request timeout of the source cluster. The unit is second.

    oldCluster_maxRetryTimeoutMillis

    60

    Specifies the retry timeout of the source cluster. The unit is second.

    batchSize

    10000

    Specifies the size of scrolling query in batches. Set this value to a smaller number if the file is large. The recommended value is from 5 MB to 15 MB. By default, this value cannot exceed 100 MB.

    scrollTime

    10

    Specifies the duration for a scrolling query. The unit is minute.

    index

    myindex

    Specifies the name of a migration index. This parameter is configured only for a single index.

    copySettingAndMapping

    true

    Specifies whether to automatically create indexes. If the value is true, indexes are automatically created, and the setting (the number of primary shards and analysis are copied, the number of replica shards is set to 0 for accelerating the index, and the refresh interval is set to -1) as well as mapping of the source cluster indexes are copied.

    copyId

    false

    Specifies whether to copy the index metadata _id. If _id does not affect service query, set this parameter to false to improve performance. However, repeated import of the same index may cause duplicate data.

    threadNum

    24

    Specifies the number of threads in a thread pool.

    timeField

    N/A

    Specifies the time field in an index. Only data within the period from startTime to endTime is migrated. The time range is [startTime, endTime). If this field is not set, all data of the index is migrated.

    NOTE:

    The formats of the startTime and endTime must be the same as those in the index. Otherwise, the migration fails.

    beginTime

    N/A

    Specifies the start time of the time field for a migration index.

    endTime

    N/A

    Specifies the end time of the time field for a migration index.

    You can view the cluster domain name on the cluster page. Log in to Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain.

Import data from the source cluster to target cluster.

  1. Run the following command to configure environment variables:

    source /opt/client/bigdata_env

  2. Run the following commands to migrate data:

    cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_scroll

    java -cp ./../lib/*:./conf/ com.*.fusioninsight.es.tool.scroll.DataMigrate

  3. Run the curl command to check whether the data is imported.

    • If the security mode is used, run the following commands:

      curl -XGET --tlsv1.2 --negotiate -k -u : "https://IP:port/indexname/_search?pretty"

      curl -XGET --tlsv1.2 --negotiate -k -u : "https://IP:port/_cat/indices?v"| grep indexname

    • If the normal mode is used, run the following commands:

      curl -XGET "http://IP:port/indexname/_search?pretty"

      curl -XGET "http://IP:port/_cat/indices?v"| grep indexname

    For details about how to use the curl command, see Running curl Commands in Linux.

  4. After data migration is complete, a json directory is generated in the tool package directory if some indexes fail to be imported. Run scripts in the sbin directory to import data failed to be imported.

    cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_scroll

    sh ./sbin/input.sh

    In the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/logs directory of the Elasticsearch client, you can view the log information about the import process.

  5. After the migration is complete, set the number of index replicas and the refresh interval of the index.

    If the index data volume of the target cluster is the same as that of the source cluster, the index has been migrated. Set the number of replicas and the refresh interval.

    For example, set the number of index replicas to 1 and the refresh interval to 1s.

    • If the security mode is used, run the following commands:
      curl -XPUT --tlsv1.2 --negotiate -k -u : 'https://IP:port/indexname/_settings' -H 'Content-Type: application/json' -d' {
      "number_of_replicas" : 1,
      "refresh_interval" : "1s"
      }'
    • If the normal mode is used, run the following commands:
      curl -XPUT 'http://IP:port/indexname/_settings' -H 'Content-Type: application/json' -d' {
      "number_of_replicas" : 1,
      "refresh_interval" : "1s"
      }'

  6. Check the index status of the target cluster. If the status of all indexes is green, replica data synchronization is complete.

    • If the security mode is used, run the following command:

      curl -XGET --negotiate -k -u : 'https://ip:port/_cat/indices?v'

    • If the normal mode is used, run the following command:

      curl -XGET 'http://ip:port/_cat/indices?v'