Using Scroll to Migrate Data
Scenario
As a tool copying cross-cluster data, Scroll migrates data of one index each time by using the rolling traversal policy and bulkAPI. This tool is used to copy and migrate data between two Elasticsearch clusters to implement convenient data reuse and security management.
- Scroll can migrate data of Elasticsearch 7.0.0 to 7.10.2 to Elasticsearch clusters.
- Based on the open-source 7.10.2 version, MRS Elasticsearch supports data migration between Elasticsearch of the same version.
Prerequisites
- The target Elasticsearch cluster is available.
- The index of the source Elasticsearch cluster has been created.
- The port and network communications between the source and target Elasticsearch cluster are normal.
- The upstream service has stopped the write operations on the source cluster to ensure data consistency after data migration, while the read operations can be normally conducted. After the migration is complete, switch to the target cluster to read and write data. If the write operation is not stopped, data may be inconsistent.
- If both the source and destination clusters are in security mode, cross-cluster mutual trust needs to be configured.
- The Elasticsearch client has been installed in a directory, for example, /opt/client.
Procedure
Modify configuration files.
- On Manager, choose Cluster > Name of the desired cluster > Services > Elasticsearch > Configurations. Search for the ELASTICSEARCH_SECURITY_ENABLE parameter, check whether it is searchable, and if it is, check whether the value is true. true indicates that the security mode is enabled.
- Upload the user authentication configuration file.
- Create a role, for example, ES_Role. For details, see Authentication Based on Users and Roles.
- Create a user.
- On Manager, choose System > Permission > User > Create.
- Enter a username, for example, test. Add user test to the elasticsearch and supergroup user groups and set the primary group to supergroup.
- Click OK.
- Choose System > Permission > User. Locate the newly created user and choose More > Download Authentication Credential. Then select the cluster information, and click OK to download the file.
- Upload the user.keytab and krb5.conf files obtained after the decompression to the specified directory of the Scroll migration tool. For details, see Table 1.
Table 1 Path for storing authentication files Mode of Source Cluster
Mode of Target Cluster
Directory for User Authentication File
Normal mode
Normal mode
None
Security mode
Normal mode
- The remote directory in the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf directory of the Elasticsearch client is created.
Example:
mkdir /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf/remote
- user.keytab and krb5.conf files of the source cluster are uploaded to the remote directory.
Normal mode
Security mode
user.keytab and krb5.conf files of the target cluster are uploaded to the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf directory on the Elasticsearch client.
Security mode
Security mode
- Cross-cluster mutual trust must be configured between the target cluster and the source cluster.
- user.keytab and krb5.conf files of the target cluster are uploaded to the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf directory on the Elasticsearch client.
- All third-party Elasticsearch clusters are in normal mode.
- If the target cluster is in security mode and other users need to read or write index data from the target cluster, corresponding permissions should be assigned to the users. For details about the permissions, see Authentication Based on Users and Roles.
- The remote directory in the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf directory of the Elasticsearch client is created.
- Run the following command to modify the espara.properties parameter configuration file in the conf directory of the tool package:
cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/conf
vi espara.properties
For details about parameter configuration, see Table 2. The following is a configuration example:
#################### destination: FusionInsight Elasticsearch ############################# esServerHost=ip1:port1,ip2:port2,ip3:port3 # unit:s socketTimeout=30 connectTimeout=30 connectRequestTimeout=30 maxRetryTimeoutMillis=60 # Whether FI ES cluster is in secure mode. true: secure, false: normal isSecureMode=false # FI ES cluster authentication user principal= ###################### source: old cluster ##################################### oldClusterHost=ip1:port1,ip2:port2,ip3:port3 # Whether old cluster is FI ES secure mode. true: secure, false: normal. NOTE: Third-party ES is false isOldSecureMode=false # Old cluster authentication user, configured only when isOldSecureMode is true oldPrincipal= # Old cluster needs Basic Authentication oldClusterUser= oldClusterPass= # unit:s oldCluster_socketTimeout=30 oldCluster_connectTimeout=30 oldCluster_connectRequestTimeout=30 oldCluster_maxRetryTimeoutMillis=60 ###################### other config ###################################### # number of documents at a time: ie "size" in the scroll request (10000) batchSize=10000 # The scroll parameter tells Elasticsearch to keep the search context open, unit: minute scrollTime=10 # index to migrate index=myindex # Whether to create an index automatically. # true: create an index automatically, copy settings and mapping of the old cluster index. # false: need to create an index in advance and specify settings and mapping copySettingAndMapping=true # whether to copy _id of index copyId=false # thread pool size threadNum=24 # time field of indices. timeField= beginTime= endTime=
Table 2 espara.properties parameter description Parameter
Default Value
Description
esServerHost
ip1:port1,ip2:port2,ip3:port3
Specifies the instance configuration for Elasticsearch in the target cluster. To facilitate load balancing, multiple groups of IP addresses and port numbers are configured. The parameter value is in the IP address:Port number format. Multiple groups of IP address and port number are separated by comma (,). The IP address is the IP address of the service plane of any DataNode in the Elasticsearch cluster, and the port number is the external HTTP(S) port number of the node.
To view the instance configuration, log in to Manager, choose Cluster > Name of the desired cluster > Services > Elasticsearch > Configurations, and search for the INSTANCE_SERVER_PORT_LIST parameter.
NOTE:Do not obtain this parameter value from the EsMaster instances.
socketTimeout
30
Specifies the socket timeout of the target cluster. The unit is second.
connectTimeout
30
Specifies the connection timeout of the target cluster. The unit is second.
connectRequestTimeout
30
Specifies the connection request timeout of the target cluster. The unit is second.
maxRetryTimeoutMillis
60
Specifies the retry timeout of the target cluster. The unit is second.
isSecureMode
false
Specifies whether the target Elasticsearch cluster is in security mode. true indicates that the cluster is in security mode, and false indicates that the cluster is in normal mode.
principal
N/A
Specifies the authentication user name of the target Elasticsearch cluster. Set this parameter if isSecureMode is set to true. The value is in the username@domain name of the target cluster format.
oldClusterHost
ip1:port1,ip2:port2,ip3:port3
Specifies the instance configuration for instances in the source Elasticsearch cluster. The parameter value is in the IP address:Port number format. Multiple groups of IP address and port number are separated by comma (,). The IP address is the IP address of the service plane of any DataNode in the source Elasticsearch cluster, and the port number is the external HTTP(S) port number of the node.
isOldSecureMode
false
Specifies whether the source cluster is in security mode. true indicates that the cluster is in security mode, and false indicates that the cluster is in normal mode. This value for the third-party Elasticsearch is false.
oldPrincipal
N/A
Specifies the authentication user of the source cluster. Set this parameter if isOldSecureMode is set to true. The value is in the username@domain name of the source cluster format.
oldClusterUser
N/A
Specifies the basic authentication username. Set this parameter if the basic authentication is required for the source cluster.
NOTE:Basic authentication is required if the source cluster is an open-source cluster in security mode.
oldClusterPass
N/A
Specifies the password of the basic authentication username. Set this parameter if the basic authentication is required for the source cluster.
NOTE:Basic authentication is required if the source cluster is an open-source cluster in security mode.
oldCluster_socketTimeout
30
Specifies the socket timeout of the source cluster. The unit is second.
oldCluster_connectTimeout
30
Specifies the connection timeout of the source cluster. The unit is second.
oldCluster_connectRequestTimeout
30
Specifies the connection request timeout of the source cluster. The unit is second.
oldCluster_maxRetryTimeoutMillis
60
Specifies the retry timeout of the source cluster. The unit is second.
batchSize
10000
Specifies the size of scrolling query in batches. Set this value to a smaller number if the file is large. The recommended value is from 5 MB to 15 MB. By default, this value cannot exceed 100 MB.
scrollTime
10
Specifies the duration for a scrolling query. The unit is minute.
index
myindex
Specifies the name of a migration index. This parameter is configured only for a single index.
copySettingAndMapping
true
Specifies whether to automatically create indexes. If the value is true, indexes are automatically created, and the setting (the number of primary shards and analysis are copied, the number of replica shards is set to 0 for accelerating the index, and the refresh interval is set to -1) as well as mapping of the source cluster indexes are copied.
copyId
false
Specifies whether to copy the index metadata _id. If _id does not affect service query, set this parameter to false to improve performance. However, repeated import of the same index may cause duplicate data.
threadNum
24
Specifies the number of threads in a thread pool.
timeField
N/A
Specifies the time field in an index. Only data within the period from startTime to endTime is migrated. The time range is [startTime, endTime). If this field is not set, all data of the index is migrated.
NOTE:The formats of the startTime and endTime must be the same as those in the index. Otherwise, the migration fails.
beginTime
N/A
Specifies the start time of the time field for a migration index.
endTime
N/A
Specifies the end time of the time field for a migration index.
You can view the cluster domain name on the cluster page. Log in to Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain.
Import data from the source cluster to target cluster.
- Run the following command to configure environment variables:
source /opt/client/bigdata_env
- Run the following commands to migrate data:
cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_scroll
java -cp ./../lib/*:./conf/ com.*.fusioninsight.es.tool.scroll.DataMigrate
- Run the curl command to check whether the data is imported.
- If the security mode is used, run the following commands:
curl -XGET --tlsv1.2 --negotiate -k -u : "https://IP:port/indexname/_search?pretty"
curl -XGET --tlsv1.2 --negotiate -k -u : "https://IP:port/_cat/indices?v"| grep indexname
- If the normal mode is used, run the following commands:
curl -XGET "http://IP:port/indexname/_search?pretty"
curl -XGET "http://IP:port/_cat/indices?v"| grep indexname
For details about how to use the curl command, see Running curl Commands in Linux.
- If the security mode is used, run the following commands:
- After data migration is complete, a json directory is generated in the tool package directory if some indexes fail to be imported. Run scripts in the sbin directory to import data failed to be imported.
cd /opt/client/Elasticsearch/tools/elasticsearch-data2es/es2es_scroll
sh ./sbin/input.sh
In the Elasticsearch/tools/elasticsearch-data2es/es2es_scroll/logs directory of the Elasticsearch client, you can view the log information about the import process.
- After the migration is complete, set the number of index replicas and the refresh interval of the index.
If the index data volume of the target cluster is the same as that of the source cluster, the index has been migrated. Set the number of replicas and the refresh interval.
For example, set the number of index replicas to 1 and the refresh interval to 1s.
- If the security mode is used, run the following commands:
curl -XPUT --tlsv1.2 --negotiate -k -u : 'https://IP:port/indexname/_settings' -H 'Content-Type: application/json' -d' { "number_of_replicas" : 1, "refresh_interval" : "1s" }'
- If the normal mode is used, run the following commands:
curl -XPUT 'http://IP:port/indexname/_settings' -H 'Content-Type: application/json' -d' { "number_of_replicas" : 1, "refresh_interval" : "1s" }'
- If the security mode is used, run the following commands:
- Check the index status of the target cluster. If the status of all indexes is green, replica data synchronization is complete.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot