Updated on 2025-10-11 GMT+08:00

Using DistCp to Copy HDFS Data Across Clusters

Scenario

DistCp is a tool used to perform large-amount data replication between clusters or in a cluster. It uses MapReduce tasks to implement distributed copy of a large amount of data.

Prerequisites

  • The Yarn client or a client that contains Yarn has been installed. For example, the installation directory is /opt/client.
  • Service users of each component are created by the MRS cluster administrator based on service requirements. In security mode, machine-machine users need to download the keytab file. A human-machine user must change the password upon the first login. (Not involved in normal mode)
  • If data needs to be copied between clusters, the data copy function must be enabled for both the clusters.

Procedure

  1. Log in to the node where the client is installed.
  2. Run the following command to go to the client installation directory:

    cd /opt/client

  3. Run the following command to configure environment variables:

    source bigdata_env

  4. If the cluster is in security mode, the user group to which the user executing the DistCp command belongs must be supergroup and the user run the following command to perform user authentication. In normal mode, user authentication is not required.

    kinit Component service user

  5. Run the DistCp command. The following provides an example:

    hadoop distcp hdfs://hacluster/source hdfs://hacluster/target

Common Usage of DistCp

  • The following provides an example of DistCp command. For details about related parameters, see Table 1.
    hadoop distcp -numListstatusThreads 40 -update -delete -prbugpaxtq hdfs://cluster1/source hdfs://cluster2/target

    In the preceding command:

    • -numListstatusThreads specifies the number of threads for creating the list of 40 copied files.
    • -update -delete synchronizes files from the source path to the target path and deletes unnecessary files from the target path. To copy incremental files, delete -delete from the command.
    • -prbugpaxtq, which works with -update, updates the status information of the copied files.
    • hdfs://cluster1/source indicates the source path, and hdfs://cluster2/target indicates the target path.
    Table 1 DistCp command parameters

    Parameter

    Description

    -p[rbugpcaxtq]

    When this parameter is used together with -update, the status information of a copied file is updated even if the content of the copied file is not updated.

    r: specifies the number of copies; b: specifies the size of a block; u: specifies the user to which the files belong; g: specifies the user group to which the user belongs; p: specifies the permission; c: specifies the check and the type; a: specifies the access control; t: specifies the timestamp; q: specifies the quota information.

    -i

    Ignores the failure during the copying.

    -log <logdir>

    Specifies the log path.

    -v

    Specifies additional information in the log.

    -m <num_maps>

    Specifies the maximum number of concurrent copy tasks.

    -numListstatusThreads

    Specifies the number of threads for creating the list of copied files. This option speeds up the DistCp process.

    -overwrite

    Overwrites the file at the target path.

    -update

    A file at the target path will be updated if it is different in size and checksum from the file at the source path.

    For details about the -update and -overwrite parameters, see update and overwrite Parameters.

    -append

    When this parameter is used together with the -update, the content of the file at the source path is added to the file at the target path.

    -f <urilist_uri>

    Copies the content of the <urilist_uri> files into the list of files to be copied.

    -filters

    Specifies a local file that contains multiple regular expressions. If a file to be copied matches a regular expression, the file will not be copied.

    -async

    Runs DistCp command asynchronously.

    -atomic {-tmp <tmp_dir>}

    Performs an atomic copy. You can add a temporary directory during copying.

    -bandwidth

    Specifies the transmission bandwidth of each copy task, in MB/s.

    -delete

    Deletes files that exist in the target path but do not exist in the source path. This option is usually used with -update, indicating that files in the source path are synchronized to the target path and redundant files in the target path are deleted.

    -diff <oldSnapshot> <newSnapshot>

    Copies the differences between the old and new versions to a file of the old version at the target location.

    -skipcrccheck

    Determines whether to skip the cyclic redundancy check (CRC) between the source file and the target file.

    -strategy {dynamic|uniformsize}

    Specifies the copy policy. The default policy is uniformsize, that is, each task copies the same number of bytes.

  • Data copy between clusters

    For example, to copy data in /foo/bar of cluster1 to /bar/foo of cluster2, run the following command:

    hadoop distcp hdfs://cluster1/foo/bar hdfs://cluster2/bar/foo

    The network between cluster1 and cluster2 must be reachable, and the two clusters must use the same HDFS version or compatible HDFS versions.

  • Data copy of multiple source directories

    For example, to copy folders a and b in cluster1 to /bar/foo in cluster2, run the following command:

    hadoop distcp hdfs://cluster1/foo/a \
    hdfs://cluster1/foo/b \
    hdfs://cluster2/bar/foo

    Alternatively, you can run the following command:

    hadoop distcp -f hdfs://cluster1/srclist \
    hdfs://cluster2/bar/foo

    The content of srclist is as follows. Before running the DistCp command, upload the srclist file to HDFS.

    hdfs://cluster1/foo/a 
    hdfs://cluster1/foo/b

update and overwrite Parameters

  • -update copies files that only exist in the source path to the target path or updates the copied files in the target path.

    When update is used, if the file to be copied already exists in the target path but the file content is different, the file content in the target path is updated.

  • -overwrite overwrites the existing files in the target path.

    When overwrite is used, if the file to be copied already exists in the target path, the file in the target path is still overwritten.

If files with the same name exist in multiple source paths, the DistCp command fails.

If neither update nor overwrite is used and the file to be copied already exists in the target path, the file will be skipped.

The following shows the differences between using no option and using either of the two options:

Assume that the structure of a file in the source path is as follows:

hdfs://cluster1/source/first/1 
hdfs://cluster1/source/first/2 
hdfs://cluster1/source/second/10 
hdfs://cluster1/source/second/20
  • Commands without options are as follows:
    hadoop distcp hdfs://cluster1/source/first hdfs://cluster1/source/second  hdfs://cluster2/target
    The preceding command creates the first and second folders in the target path by default. Therefore, the copy result is as follows:
    hdfs://cluster2/target/first/1 
    hdfs://cluster2/target/first/2 
    hdfs://cluster2/target/second/10 
    hdfs://cluster2/target/second/20
  • The command with any one of the two options (for example, update) is as follows:
    hadoop distcp -update hdfs://cluster1/source/first hdfs://cluster1/source/second  hdfs://cluster2/target
    The preceding command copies only the content in the source path to the target path. Therefore, the copy result is as follows:
    hdfs://cluster2/target/1 
    hdfs://cluster2/target/2 
    hdfs://cluster2/target/10 
    hdfs://cluster2/target/20

Helpful Links