Help Center/ MapReduce Service/ Best Practices/ Data Migration/ Migrating Data from Hadoop to MRS with CDM

Updated on 2025-08-14 GMT+08:00

View PDF

Migrating Data from Hadoop to MRS with CDM

Scenarios

Cloud Data Migration (CDM) is an efficient and easy-to-use service for batch data migration. Leveraging cloud-based big data migration and intelligent data lake solutions, CDM offers user-friendly functions for migrating data and integrating diverse data sources into a unified data lake. These capabilities simplify the complexities of data source migration and integration, significantly enhancing efficiency.

This section describes how to migrate data from Hadoop clusters in an on-premises IDC or on a public cloud to Huawei Cloud MRS. The data volume can be tens of TBs or less.

Solution Architecture

Figure 1 Hadoop data migration

Click to enlarge

CDM supports both full and incremental file migration. Full migration is implemented by copying files. You can implement incremental migration by setting Duplicate File Processing Method to Skip.

Full migration
- Create two links on CDM to connect HDFS of the source cluster and the HDFS or OBS file system of Huawei Cloud MRS.
- Create a full migration job on CDM, configure source and destination parameters, and start the job.
Incremental migration
- Create a migration job on CDM again. When configuring destination parameters, select Skip for Duplicate File Processing Method.
- Start the migration job.

The process for using CDM to migrate Hadoop data to an MRS cluster is as follows.

Figure 2 Data migration process
Click to enlarge

Solution Advantages

Easy-to-use: The wizard-based development interface frees you from programming but helps you develop migration tasks by simple configurations in minutes.
High migration efficiency: The performance of data migration and transmission is enhanced based on the distributed computing framework. Data write performance of specific data sources is optimized to improve data migration efficiency.
Real-time monitoring: During the migration, automatic real-time monitoring, alarms, and notifications can be performed.

Impact on the System

During the migration, data inconsistency may occur if the changes on the HDFS files in the source cluster are not timely synchronized to the destination cluster.
You can use the verification tool to identify inconsistent data, and migrate or add the data.
The migration may cause the performance of the source cluster to deteriorate, increasing the response time of source services. It is recommended that you migrate data during off-peak hours and properly configure resources, including compute, storage, and network resources, in the source cluster to ensure that the cluster can handle the migration workloads.

Migration Survey

Before migrating HDFS data, you need to conduct a survey on the source HDFS component to evaluate the product compatibility, risks that may occur during the migration, and impact on the system. For details, see Table 1.

**Table 1** Migration survey
No.	Survey Item	Survey Question
1	Version compatibility	What is the HDFS cluster version?
2	Total capacity	What is the total storage capacity required by the user? (Planned total disk capacity = Estimated capacity x Number of replicas)
3	Total data volume	How much data is processed by HDFS?
4	Number of files	How many files are stored on HDFS?
5	Number of replicas	What is the number of replicas? It is three by default, and can be changed if required.
6	Percentage of small files	What is the approximate distribution of large and small files? (Percentage of small files and percentage of large files. Files smaller than 128 MB are considered small files.)
7	Number of reads/writes	What are the peak reads/writes per second? (11,000 reads/second and 3,000 writes/second on a single node)
8	Throughput	What is the peak read/write throughput per second? (100 MB/s per node)

Networking Types

The migration solution supports various networking types, such as the public network, VPN, and Direct Connect. Select a networking type based on the site requirements. The migration can be performed only when the source and destination networks can communicate with CDM.

**Table 2** Networking types
Migration Network Type	Advantage	Disadvantage
Direct Connect	Stable performance with millisecond-level latency Maximum bandwidth up to dozens of Gbit/s Highly secure data transmission	High costs. Generally, the yearly/monthly billing mode is used. The source and destination private IP addresses cannot overlap. Long service provisioning duration. Generally, you need to apply for it one month in advance.
VPN	Flexible networking and setup at any time Good stability and security Moderate costs with only public network fee and VPN fee	There is a high network latency. The source and destination private IP addresses cannot overlap.
Public IP address	Support for migration when the source and destination private IP addresses are the same Bandwidth size from Mbit/s to Gbit/s Instant availability after purchase and quick binding Low costs	Poor stability. The bandwidth may not be fully used, and the migration speed is relatively slow. Data transmitted over public networks is vulnerable to leakage.

Notes and Constraints

Migrating a large volume of data has high requirements on network communication. When a migration task is executed, other services may be adversely affected. You are advised to migrate data during off-peak hours.
Data attributes, such as the owner, ACL, and checksum, cannot be migrated using CDM.
This section uses Huawei Cloud CDM 2.9.2.200 as an example to describe how to migrate data. The operations may vary depending on the CDM version. For details, see the operation guide of the required version.
For details about the data sources supported by CDM, see Supported Data Sources. If the data source is Apache HDFS, the recommended version is 2.8.X or 3.1.X. Before performing the migration, ensure that the data source supports migration.

Creating a Data Connection

Log in to the CDM console.
Create a CDM cluster. The security group, VPC, and subnet of the CDM cluster must be the same as those of the destination cluster to ensure that the CDM cluster can communicate with the MRS cluster.
On the Cluster Management page, locate the row containing the desired cluster and click Job Management in the Operation column.
On the Links tab page, click Create Link.
Create two HDFS links, one to the source cluster and the other to the destination cluster. For details, see Creating a Link Between CDM and a Data Source.

Set the connector type based on the actual cluster. For an MRS cluster, select MRS HDFS. For a self-built cluster, select Apache HDFS.

Figure 3 HDFS link

Creating a Migration Job

On the Table/File Migration tab page, click Create Job.
Select the source and destination links.
- Job Name: Enter a custom job name, which can contain 1 to 256 characters consisting of letters, underscores (_), and digits.
- Source Link Name: Select the HDFS link of the source cluster. Data is exported from this link when the job is running.
- Destination Link Name: Select the HDFS link of the destination cluster. Data is imported to this link when the job is running.
Configure source job parameters by referring to From HDFS. You can set Directory Filter and File Filter to specify the directories and files to be migrated.
- If you use CDM to perform full data migration, select REPLACE for Duplicate File Processing Method in the Destination Job Configuration area.
- If you use CDM to perform incremental data migration, select Skip for Duplicate File Processing Method in the Destination Job Configuration area.
For example, if you need to migrate files in the /user/test* folder, set File Format to Binary.
Figure 4 Configuring job parameters
Configure destination job parameters by referring to To HDFS.
Click Next. The task configuration page is displayed.
- If you need to periodically migrate new data to the destination cluster, configure a scheduled task on this page. Alternatively, you can configure a scheduled task later by referring to 3.
- If no new data needs to be migrated periodically, skip the configurations on this page and click Save.
  Figure 5 Task configuration
Choose Job Management and click the Table/File Migration tab. Click Run in the Operation column of the job to be executed to start migrating HDFS files. Wait until the job execution is complete.

Checking the Migrated Files

Log in to the client node of the destination cluster as the client installation user.
Run the following commands to check the files that have been migrated to the destination cluster:
```
cd Client installation directory
```
Load environment variables.
```
source bigdata_env
```
If Kerberos authentication has been enabled for the cluster (in security mode), run the following command for user authentication. If Kerberos authentication is disabled for the cluster (in normal mode), user authentication is not required.
```
kinit Component service user
```
```
hdfs dfs -ls -h /user/
```
(Optional) If new data in the source cluster needs to be periodically migrated to the destination cluster, configure a scheduled task for incremental data migration until all services are migrated to the destination cluster.
1. On the Cluster Management page of the CDM console, choose Job Management and click the Table/File Migration tab.
2. In the Operation column of the migration job, click More and select Configure Scheduled Execution.
3. Enable the scheduled job execution function, configure the execution cycle based on service requirements, and set the end time of the validity period to the time after all services are migrated to the new cluster.
  Figure 6 Configuring schedule execution