Migrating Data Between Elasticsearch Clusters Using ESM

Scenario

The Elasticsearch Migration Tool (ESM) is an open-source tool designed for migrating data between Elasticsearch clusters, including between those of different versions. When using ESM to migrate data, you can adjust the migration speed by tuning the parameters of the scroll API to accommodate various network conditions and service requirements. Below are some scenarios where you might use ESM for data migration.

Cross-version data migration: Use ESM to seamlessly migrate data when upgrading an Elasticsearch cluster to a new version, ensuring data integrity and availability during and after the upgrade.
Cluster merging: Merge index data scattered across multiple Elasticsearch clusters into a single cluster for centralized data management and analysis.
Cloud migration: Migrate an on-premises Elasticsearch service to the cloud to enjoy the benefits of cloud services, such as scalability, ease-of-maintenance, and cost-effectiveness.
Changing the service provider: Switch from a third-party Elasticsearch service to Huawei Cloud for reasons related to cost, performance, or other strategic considerations.

Overview

Figure 1 Migration procedure

Figure 1 shows how to migrate data between Elasticsearch clusters using ESM.

Install ESM on a Linux VM.
Run ESM commands to migrate indexes from the source cluster to the destination cluster.

Advantages

Cross-version data migration: You can use ESM to migrate data between Elasticsearch clusters of different versions, including from an earlier version to a later version.
Easy to use: ESM is easy to use and is developed using the Go language. It quickly becomes available after being installed using an installation package.
Performance control: During the migration, you can tune the parameters of the scroll API to control the data migration speed for optimal cluster performance.
Flexible migration options: ESM supports both full migration and incremental migration, accommodating different needs.
Free: As an open-source tool, the ESM code is hosted on GitHub and accessible to all users for free.

Impact on Performance

Using ESM for data migration between clusters relies on the scroll API. The scroll API can efficiently retrieve index data from the source cluster and synchronize the data to the destination cluster in batches. This process may impact the performance of the source cluster. The specific impact depends on how fast data is retrieved from the source cluster, and the data retrieval speed depends on the size and slice settings of the scroll API. For details, see the Reindex API document.

ESM can quickly read data from the source cluster, potentially impacting its performance. Therefore, it is advisable to perform the data migration during off-peak hours. Additionally, you may need to monitor changes in the CPU and memory metrics of the source cluster. By tuning the migration speed and choosing an appropriate time window for the migration, you can keep the performance impact under control. If you have large amounts to data to migrate or if the source cluster has a high resource usage, you should perform the migration during off-peak hours, reducing impact on the performance of the source cluster.

Constraints

During cluster migration, do not add, delete, or modify the index data of the source cluster. Otherwise, the migration may fail, or data in the source cluster may be inconsistent with that in the destination cluster after the migration.

Prerequisites

The source and destination Elasticsearch clusters are available.
The network between the clusters is connected.
- If the source and destination clusters are in different VPCs, establish a VPC peering connection between them. For details, see VPC Peering Connection Overview.
- To migrate an in-house built Elasticsearch cluster to Huawei Cloud, you can configure public network access for this cluster.
- To migrate a third-party Elasticsearch cluster to Huawei Cloud, you need to establish a VPN or Direct Connect connection between the third party's internal data center and Huawei Cloud.
Ensure that _source has been enabled for indexes in the cluster.
By default, _source is enabled. You can run the GET {index}/_search command to check whether it is enabled. If the returned index information contains _source, it is enabled.

Obtaining Elasticsearch Cluster Information

Before data migration, obtain necessary information about the source and destination clusters for configuring a migration task.

**Table 1** Required Elasticsearch cluster information
Cluster Type		Required Information	How to Obtain
Source cluster	Huawei Cloud Elasticsearch cluster	Access address of the source cluster Username and password for accessing the source cluster (only for security-mode clusters)	For details about how to obtain the cluster name and address, see 3. Contact the service administrator to obtain the username and password.
	In-house built Elasticsearch cluster	Public network address of the source cluster Username and password for accessing the source cluster (only for security-mode clusters)	Contact the service administrator to obtain the information.
	Third-party Elasticsearch cluster	Access address of the source cluster Username and password for accessing the source cluster (only for security-mode clusters)	Contact the service administrator to obtain the information.
Destination cluster	Huawei Cloud Elasticsearch cluster	Access address of the destination cluster Username and password for accessing the destination cluster (only for security-mode clusters)	For details about how to obtain the access address, see 3. Contact the service administrator to obtain the username and password.

The method of obtaining the cluster information varies depending on the source cluster. This section describes how to obtain information about a Huawei Cloud Elasticsearch cluster.

Log in to the CSS management console.
In the navigation pane on the left, choose Clusters > Elasticsearch.
In the Elasticsearch cluster list, obtain the cluster name and address.
Figure 2 Obtaining cluster information

Preparing the VM Used for the Migration

Create an ECS where you install ECM for Elasticsearch cluster migration.

Buy a Linux ECS, select the CentOS 7 image, and set the VPC to that of the destination cluster. For details, see Purchasing and Using a Linux ECS.
Test the connectivity between the ECS and the source and destination clusters.
Run the following command on the ECS to test connectivity. If the correct cluster information is returned, the connection is ready.
```
# Non-security mode cluster
curl -ik http://ip:9200    
# Security-mode cluster + HTTPS
curl -ik https://ip:9200 -u[Username]:[password]
```

Migrating Data Using ESM

Visit the ESM download address, and download the migrator-linux-amd64 installation package.
Use SCP to upload the downloaded migrator-linux-amd64 to a path on the Linux ECS.

Run the following command in the above path on the Linux ECS to migrate index structures and data from the source cluster to the destination cluster:

# Full migration
./migrator-linux-amd64 -s http://source:9200 -d http://dest:9200 -x index_name -m admin:password -n admin:password --copy_settings --copy_mappings -w 5 -b 10

# Incremental migration
./migrator-linux-amd6 -s http://source:9200 -d http://dest:9200 -x index-test -m admin:password -n admin:password -w 5 -b 10 -q "timestamp:[\"2022-01-17 03:41:20\" TO \"2022-01-22 03:41:20\"]"

For commonly used parameters in the migration command, see Table 2. For even more information, see ESM documents.

**Table 2** Commonly used parameters in ESM migration commands
Parameter	Example	Description
-s, --source=	http://source:9200	Address for accessing the source Elasticsearch cluster
-d, --dest=	http://dest:9200	Address for accessing the destination Elasticsearch cluster
-x, --src_indexes=	index_name index1,index2	The names of the indexes in the source cluster waiting to be migrated. A regular expression can be used to specify indexes. Separate multiple indexes using a comma (,).
-y, --dest_index=	index_name_rename	Index name in a destination cluster. You may specify just a single index name. If left blank, the source index names will be used.
-m, --source_auth=	admin:password	Username and password for accessing the source Elasticsearch cluster (only for security-mode clusters)
-n, --dest_auth=	admin:password	Username and password for accessing the destination Elasticsearch cluster (only for security-mode clusters)
-w, --workers=	5	The number of concurrent threads for bulk data reading. This parameter controls how fast data will be read from the source cluster. Default value: 1
-b, --bulk_size=	10	The bulk size for data reading. This parameter also controls how fast data will be read from the source cluster. Default value: 5 MB
--sliced_scroll_size	4	Size of sliced scroll. This parameter also controls how fast data will be read from the source cluster. Default value: 1
--copy_settings	-	Whether to migrate index settings in the source cluster
--copy_mappings	-	Whether to migrate index mappings in the source cluster
--buffer_count=	-	Number of files to be cached in the memory of the VM that hosts ESM. Default value: 100,000

After the data migration is completed, check data consistency by comparing the number of files.

# Non-security mode cluster
curl -ik http://ip:9200/{index name}/_count
# Security-mode cluster + HTTPS
curl -ik https://ip:9200 -u[Username]:[password]/{ index name}/_count

FAQ

How do I handle the error message "out of memory" displayed during migration?
The error message "out of memory", if it is displayed during the migration, indicates a memory overflow on the ESM ECS. Handle it using one of the following methods:
- Use a larger flavor the ECS. For details, see Modifying Individual ECS Specifications.
- Reduce the value of buffer_count used in the ESM migration command to reduce the number of files that can be cached in the memory of the ECS.
Why is the total size of index data inconsistent in the source and destination clusters after the migration?
This is normal when ESM is used to migrate Elasticsearch clusters. In a typical Elasticsearch cluster, multiple shards are used to store data, and each shard has multiple segments. After data is migrated from the source cluster to the destination cluster using ESM, new segments and shards are generated in the destination cluster. Different numbers of segments and shards in the source and destination clusters lead to different data expansion rates, hence different data sizes. To check data consistency, compare the number of files, rather than the data size.