Help Center/ MapReduce Service/ Component Operation Guide (Normal)/ Using HDFS/ HDFS O&M Management/ Using DistCp to Copy HDFS Data Across Clusters

Updated on 2025-10-23 GMT+08:00

Using DistCp to Copy HDFS Data Across Clusters

Scenario

DistCp is a tool used to perform large-amount data replication between clusters or in a cluster. It uses MapReduce tasks to implement distributed copy of a large amount of data.

Prerequisites

The Yarn client or a client that contains Yarn has been installed. For example, the installation directory is /opt/client.
Service users of each component are created by the MRS cluster administrator based on service requirements. In security mode, machine-machine users need to download the keytab file. A human-machine user must change the password upon the first login. (Not involved in normal mode)
If data needs to be copied between clusters, the data copy function must be enabled for both the clusters.

Procedure

Log in to the node where the client is installed.
Run the following command to go to the client installation directory:
```
cd /opt/client
```
Run the following command to configure environment variables:
```
source bigdata_env
```
If the cluster is in security mode, the user group to which the user executing the DistCp command belongs must be supergroup and the user run the following command to perform user authentication. In normal mode, user authentication is not required.
```
kinit Component service user
```

Run the DistCp command. The following provides an example:

hadoop distcp hdfs://hacluster/source hdfs://hacluster/target

Common Usage of DistCp

The following provides an example of DistCp command. For details about related parameters, see Table 1.

hadoop distcp -numListstatusThreads 40 -update -delete -prbugpaxtq hdfs://cluster1/source hdfs://cluster2/target

In the preceding command:

-numListstatusThreads specifies the number of threads for creating the list of 40 copied files.

-update -delete synchronizes files from the source path to the target path and deletes unnecessary files from the target path. To copy incremental files, delete -delete from the command.

-prbugpaxtq, which works with -update, updates the status information of the copied files.

hdfs://cluster1/source indicates the source path, and hdfs://cluster2/target indicates the target path.

**Table 1** DistCp command parameters
Parameter	Description
-p[rbugpcaxtq]	When this parameter is used together with -update, the status information of a copied file is updated even if the content of the copied file is not updated. r: specifies the number of copies; b: specifies the size of a block; u: specifies the user to which the files belong; g: specifies the user group to which the user belongs; p: specifies the permission; c: specifies the check and the type; a: specifies the access control; t: specifies the timestamp; q: specifies the quota information.
-i	Ignores the failure during the copying.
-log <logdir>	Specifies the log path.
-v	Specifies additional information in the log.
-m <num_maps>	Specifies the maximum number of concurrent copy tasks.
-numListstatusThreads	Specifies the number of threads for creating the list of copied files. This option speeds up the DistCp process.
-overwrite	Overwrites the file at the target path.
-update	A file at the target path will be updated if it is different in size and checksum from the file at the source path. For details about the -update and -overwrite parameters, see update and overwrite Parameters.
-append	When this parameter is used together with the -update, the content of the file at the source path is added to the file at the target path.
-f <urilist_uri>	Copies the content of the <urilist_uri> files into the list of files to be copied.
-filters	Specifies a local file that contains multiple regular expressions. If a file to be copied matches a regular expression, the file will not be copied.
-async	Runs DistCp command asynchronously.
-atomic {-tmp <tmp_dir>}	Performs an atomic copy. You can add a temporary directory during copying.
-bandwidth	Specifies the transmission bandwidth of each copy task, in MB/s.
-delete	Deletes files that exist in the target path but do not exist in the source path. This option is usually used with -update, indicating that files in the source path are synchronized to the target path and redundant files in the target path are deleted.
-diff <oldSnapshot> <newSnapshot>	Copies the differences between the old and new versions to a file of the old version at the target location.
-skipcrccheck	Determines whether to skip the cyclic redundancy check (CRC) between the source file and the target file.
-strategy {dynamic\|uniformsize}	Specifies the copy policy. The default policy is uniformsize, that is, each task copies the same number of bytes.

Data copy between clusters
For example, to copy data in /foo/bar of cluster1 to /bar/foo of cluster2, run the following command:
```
hadoop distcp hdfs://cluster1/foo/bar hdfs://cluster2/bar/foo
```
The network between cluster1 and cluster2 must be reachable, and the two clusters must use the same HDFS version or compatible HDFS versions.
Data copy of multiple source directories
For example, to copy folders a and b in cluster1 to /bar/foo in cluster2, run the following command:
```
hadoop distcp hdfs://cluster1/foo/a \
hdfs://cluster1/foo/b \
hdfs://cluster2/bar/foo
```
Alternatively, you can run the following command:
```
hadoop distcp -f hdfs://cluster1/srclist \
hdfs://cluster2/bar/foo
```
The content of srclist is as follows. Before running the DistCp command, upload the srclist file to HDFS.
```
hdfs://cluster1/foo/a 
hdfs://cluster1/foo/b
```

update and overwrite Parameters

-update copies files that only exist in the source path to the target path or updates the copied files in the target path.
When update is used, if the file to be copied already exists in the target path but the file content is different, the file content in the target path is updated.
-overwrite overwrites the existing files in the target path.
When overwrite is used, if the file to be copied already exists in the target path, the file in the target path is still overwritten.

If files with the same name exist in multiple source paths, the DistCp command fails.

If neither update nor overwrite is used and the file to be copied already exists in the target path, the file will be skipped.

The following shows the differences between using no option and using either of the two options:

Assume that the structure of a file in the source path is as follows:

hdfs://cluster1/source/first/1 
hdfs://cluster1/source/first/2 
hdfs://cluster1/source/second/10 
hdfs://cluster1/source/second/20

Commands without options are as follows:

hadoop distcp hdfs://cluster1/source/first hdfs://cluster1/source/second  hdfs://cluster2/target

The preceding command creates the first and second folders in the target path by default. Therefore, the copy result is as follows:

hdfs://cluster2/target/first/1 
hdfs://cluster2/target/first/2 
hdfs://cluster2/target/second/10 
hdfs://cluster2/target/second/20

The command with any one of the two options (for example, update) is as follows:

hadoop distcp -update hdfs://cluster1/source/first hdfs://cluster1/source/second  hdfs://cluster2/target

The preceding command copies only the content in the source path to the target path. Therefore, the copy result is as follows:

hdfs://cluster2/target/1 
hdfs://cluster2/target/2 
hdfs://cluster2/target/10 
hdfs://cluster2/target/20

Helpful Links

If the DistCp command fails to be executed in a cluster in security mode and an exception occurs, see What Should I Do If an Error Is Reported When I Run DistCp Commands?.
The following lists common problems occurred during DistCp command execution:

How Do I Copy Large-Size Files Using the DistCp Command?

When you run the DistCp command to copy large-size files, you are advised to change the timeout period of MapReduce for executing copy tasks. It can be implemented by specifying the mapreduce.task.timeout in the DistCp command. For example, to change the timeout period to 30 minutes, run the following command:

hadoop distcp -Dmapreduce.task.timeout=1800000 hdfs://cluster1/source hdfs://cluster2/target

Or, you can also use filters to exclude the large files out of the copy process. The command example is as follows:

hadoop distcp -filters /opt/client/filterfile hdfs://cluster1/source hdfs://cluster2/target

In the preceding command, filterfile indicates a local file, which contains multiple expressions used to match the path of a file that is not copied. The following is an example:

.*excludeFile1.*
.*excludeFile2.*

"java.lang.OutOfMemoryError" Is Reported When the DistCp Command Is Executed

What can I do if the DistCp command unexpectedly quits, and the error message "java.lang.OutOfMemoryError" is displayed?

This error occurs because the memory required for running the copy tasks exceeds the preset memory upper limit (128 MB by default). You can change the memory upper limit of the client by modifying CLIENT_GC_OPTS in <Client installation path>/HDFS/component_env. For example, to set the memory upper limit to 1 GB, configure the following parameter:

CLIENT_GC_OPTS="-Xmx1G"

Then, run the following command to apply the configurations:

source <Client installation path>//bigdata_env

"Too many chunks created with splitRatio" Is Reported When the DistCp Command Is Executed Using the Dynamic Policy

What can I do when the dynamic policy is used to execute the DistCp command, the command exits abnormally, and "Too many chunks created with splitRatio" error message is displayed?

The error occurs because the value of distcp.dynamic.max.chunks.tolerable (default value: 20,000) is less than the value of distcp.dynamic.split.ratio (default value: 2) multiplied by the number of Maps. This error occurs when the number of Maps exceeds 10,000. You can use the -m parameter to reduce the number of Maps to less than 10,000.

hadoop distcp -strategy dynamic -m 9500 hdfs://cluster1/source hdfs://cluster2/target

Alternatively, you can use the -D parameter to set distcp.dynamic.max.chunks.tolerable to a larger value.

hadoop distcp -Ddistcp.dynamic.max.chunks.tolerable=30000 -strategy dynamic hdfs://cluster1/source hdfs://cluster2/target

Parent topic: HDFS O&M Management

Previous topic: Configuring the Maximum Lifetime of an HDFS Token

Next topic: Configuring the NFS Server to Store NameNode Metadata

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.