Help Center/ MapReduce Service/ Best Practices/ Data Migration/ Migrating HDFS Data from a Self-Built Cluster to an MRS Cluster with DistCp

Updated on 2025-08-20 GMT+08:00

View PDF

Migrating HDFS Data from a Self-Built Cluster to an MRS Cluster with DistCp

Scenarios

Distributed Copy (DistCp) is a distributed data replication tool provided by Hadoop. It is designed to efficiently transfer large volumes of data between HDFS clusters, between HDFS and the local file system, or even within the same cluster. DistCp leverages Hadoop's distributed computing capabilities and the parallel processing framework of MapReduce to enable highly efficient migration of large-scale data.

You can run the distcp command to migrate data stored in HDFS of a self-built Hadoop cluster to an MRS cluster.

Solution Architecture

Figure 1 shows the process for migrating data from a Hadoop cluster to an MRS cluster.

Figure 1 Migration solution with DistCp
Click to enlarge

Full data migration
1. Create a snapshot for the directory to be migrated in the source Hadoop cluster.
2. On the MRS cluster client, run the DistCp command for full data migration and set the destination path to OBS or HDFS.
Incremental data migration
1. Create a snapshot for the directory that requires incremental migration in the source Hadoop cluster.
2. On the MRS cluster client, run the DistCp command with the -update -delete parameter added to perform incremental migration. Set the destination path to HDFS.

Data migration supports different networking types, such as the public network, VPN, and Direct Connect. Select a networking type as required to ensure that the source and destination networks can communicate with each other.

**Table 1** Networking types for data migration
Migration Network Type	Advantage	Disadvantage
Direct Connect	Stable performance with millisecond-level latency Maximum bandwidth up to dozens of Gbit/s Highly secure data transmission	High costs. Generally, the yearly/monthly billing mode is used. The source and destination private IP addresses cannot overlap. Long service provisioning duration. Generally, you need to apply for it one month in advance.
VPN	Flexible networking and setup at any time Good stability and security Moderate costs	There is a high network latency. The source and destination private IP addresses cannot overlap.
Public IP address	Support for migration when the source and destination private IP addresses are the same Flexible bandwidth allocation Instant availability after purchase and quick binding with low costs	The bandwidth may not be fully used, and the migration speed is relatively slow. Data transmitted over public networks is vulnerable to leakage.

Notes and Constraints

If the migration destination is an OBS bucket, DistCp does not support incremental migration because OBS does not support snapshots.
You need to learn about the customer's services and communicate with the customer to determine the migration time window. You are advised to perform the migration during off-peak hours.

Prerequisites

You have created an MRS cluster that contains the Hadoop service.
You have prepared an ECS execution node and installed the MRS cluster client on the node. The node and the MRS cluster are in the same VPC and security group.
The source cluster, destination cluster, and the node where the MRS cluster client is installed can communicate with each other. (Direct Connect is recommended.)

Migrating All Data Using DistCp

Install the MRS cluster client on the prepared ECS node.

For details about how to install the client, see Installing an MRS Cluster Client. For example, the client installation directory is /opt/client.
Obtain the IP address and port number of the active HDFS NameNode in the source cluster.

The query method varies depending on the self-built clusters. You can contact the cluster administrator or log in to the cluster management page to query the information.

Figure 2 Checking the HDFS NameNode port
Log in to MRS Manager of the destination cluster and check the IP address and port number of the active HDFS NameNode.
1. Choose Cluster > Services > HDFS and click Instances. On the displayed page, view the IP address of the NameNode (hacluster,Active) instance in the instance list.
2. Click Configurations, and search for and view the NameNode port parameter dfs.namenode.rpc.port.
Obtain the HDFS file directory containing the data to be migrated from the source cluster.

For example, in this example, the directory to be migrated is /benchmarks/TestDFSIO/io_data.
Configure user access to the source HDFS on the ECS execution node.
```
vim /etc/profile
```
Configure environment variables.
```
export HADOOP_USER_NAME=hdfs
```
Load the environment variables.
```
source /etc/peofile
```
Create a snapshot for the specified HDFS directory in the source cluster. The snapshot is saved in /Snapshot directory/.snapshot.
1. Enable the snapshot function for the specified directory in the source cluster.
```
hdfs dfsadmin -allowSnapshot Snapshot directory
```
  Source cluster directory format: hdfs://IP address of the active NameNode:Port/Directory
  
  For example, run the following command:
```
hdfs dfsadmin -allowSnapshot hdfs://192.168.1.100:8020/benchmarks/TestDFSIO/io_data
```
2. Create a snapshot.
```
hdfs dfs -createSnapshot Snapshot directory Snapshot file name
```
  For example, run the following command:
```
hdfs dfs -createSnapshot hdfs://192.168.1.100:8020/benchmarks/TestDFSIO/io_data snapshot01
```
Go to the MRS cluster client directory.
```
cd /opt/client
```
```
source bigdata_env
```
User authentication is required for MRS clusters with Kerberos authentication enabled.
```
kinit HDFS service user
```

Run the distcp command on the node where the MRS cluster client is installed to migrate data.

hadoop distcp -prbugpcxt Path for storing snapshots in the source cluster Data storage path in the destination cluster

For example, run the following command:

hadoop distcp -prbugpcxt hdfs://192.168.1.100:8020/benchmarks/TestDFSIO/io_data/.snapshot/snapshot01/* hdfs://192.168.1.200:8020/hdfs/

Wait until the MapReduce task is complete.

Click to enlarge

After the full migration is complete, check whether the files are successfully migrated to the destination cluster.
```
hdfs dfs -ls HDFS file directory
```
For example, all files in the source cluster are as follows.

All files in the destination cluster are as follows.

Migrating Incremental Data using DistCp

Create a snapshot for the directory that requires incremental migration in the source cluster.

hdfs dfs -createSnapshot Snapshot directory Snapshot file name

For example, run the following command:

hdfs dfs -createSnapshot hdfs://192.168.1.100:8020/benchmarks/TestDFSIO/io_data snapshot02

Run the distcp command on the node where the MRS cluster client is installed to migrate incremental data.

hadoop distcp -prbugpcxt Path for storing snapshots in the source cluster Data storage path in the destination cluster

For example, run the following command:

hadoop distcp -prbugpcxt -update -delete hdfs://192.168.1.100:8020/benchmarks/TestDFSIO/io_data/.snapshot/snapshot02/* hdfs://192.168.1.200:8020/hdfs

Wait until the MapReduce task is complete.

After migration, check the number and size of files in the destination and compare them with the source to ensure consistency.
```
hdfs dfs -ls HDFS file directory
```
For example, incremental files in the source cluster are as follows.

Incremental files in the destination cluster are as follows.

FAQs About Data Migration Using DistCp

Version differences
If the Hadoop version of the source cluster matches the MRS cluster in major version (minor versions may differ), you can use the following command to synchronize service data from the source table to the MRS cluster:
```
hadoop distcp -prbugpcaxt hdfs://Active NameNode IP address:Port/Snapshot path Save path
```
If the Hadoop version of the source cluster is significantly different from the MRS cluster version, use WebHDFS or HFTP to replace HDFS to run the following commands:
```
hadoop distcp -prbugpcaxt webhdfs://Active NameNode IP address:Port/Snapshot path Save path
```
Or
```
hadoop distcp -prbugpcaxt hftp://Active NameNode IP address:Port/Snapshot path Save path
```
Parameters of the DistCp commands
- -p[rbugpcaxt]: specifies the attributes of the retained data.
- IP address: IP address of the active NameNode in the source cluster.
- Port: The default open-source port for HDFS is 8020, and the port for WebHDFS and HFTP is 50070.
- If the source cluster provides a unified domain name for access, you can change Active NameNode IP address:Port to the domain name URL.
DistCp permission control
When DistCp is used for migration, file or directory copy operations may fail due to insufficient permissions.

Ensure that the user executing the DistCp command has sufficient permissions to read from the source path and write to the destination path. You can use the -p option to preserve the permission information.
Timeout interval change using the DistCp command
When you run the DistCp command, if some files being copied are particularly large, you are advised to increase the timeout settings for the MapReduce jobs that execute the copy tasks.

You can specify mapreduce.task.timeout in the DistCp command to increase the timeout interval.

For example, run the following command to change the timeout interval to 30 minutes:
```
hadoop distcp -Dmapreduce.task.timeout=1800000 hdfs://cluster1/source hdfs://cluster2/target
```