Using DistCp to Copy HDFS Data Across Clusters
Scenario
DistCp is a tool used to perform large-amount data replication between clusters or in a cluster. It uses MapReduce tasks to implement distributed copy of a large amount of data.
Prerequisites
- The Yarn client or a client that contains Yarn has been installed. For example, the installation directory is /opt/client.
- Service users of each component are created by the MRS cluster administrator based on service requirements. In security mode, machine-machine users need to download the keytab file. A human-machine user must change the password upon the first login. (Not involved in normal mode)
- If data needs to be copied between clusters, the data copy function must be enabled for both the clusters.
Procedure
- Log in to the node where the client is installed.
- Run the following command to go to the client installation directory:
cd /opt/client
- Run the following command to configure environment variables:
source bigdata_env
- If the cluster is in security mode, the user group to which the user executing the DistCp command belongs must be supergroup and the user run the following command to perform user authentication. In normal mode, user authentication is not required.
kinit Component service user
- Run the DistCp command. The following provides an example:
hadoop distcp hdfs://hacluster/source hdfs://hacluster/target
Common Usage of DistCp
- The following provides an example of DistCp command. For details about related parameters, see Table 1.
hadoop distcp -numListstatusThreads 40 -update -delete -prbugpaxtq hdfs://cluster1/source hdfs://cluster2/target
In the preceding command:
- -numListstatusThreads specifies the number of threads for creating the list of 40 copied files.
- -update -delete synchronizes files from the source path to the target path and deletes unnecessary files from the target path. To copy incremental files, delete -delete from the command.
- -prbugpaxtq, which works with -update, updates the status information of the copied files.
- hdfs://cluster1/source indicates the source path, and hdfs://cluster2/target indicates the target path.
Table 1 DistCp command parameters Parameter
Description
-p[rbugpcaxtq]
When this parameter is used together with -update, the status information of a copied file is updated even if the content of the copied file is not updated.
r: specifies the number of copies; b: specifies the size of a block; u: specifies the user to which the files belong; g: specifies the user group to which the user belongs; p: specifies the permission; c: specifies the check and the type; a: specifies the access control; t: specifies the timestamp; q: specifies the quota information.
-i
Ignores the failure during the copying.
-log <logdir>
Specifies the log path.
-v
Specifies additional information in the log.
-m <num_maps>
Specifies the maximum number of concurrent copy tasks.
-numListstatusThreads
Specifies the number of threads for creating the list of copied files. This option speeds up the DistCp process.
-overwrite
Overwrites the file at the target path.
-update
A file at the target path will be updated if it is different in size and checksum from the file at the source path.
For details about the -update and -overwrite parameters, see update and overwrite Parameters.
-append
When this parameter is used together with the -update, the content of the file at the source path is added to the file at the target path.
-f <urilist_uri>
Copies the content of the <urilist_uri> files into the list of files to be copied.
-filters
Specifies a local file that contains multiple regular expressions. If a file to be copied matches a regular expression, the file will not be copied.
-async
Runs DistCp command asynchronously.
-atomic {-tmp <tmp_dir>}
Performs an atomic copy. You can add a temporary directory during copying.
-bandwidth
Specifies the transmission bandwidth of each copy task, in MB/s.
-delete
Deletes files that exist in the target path but do not exist in the source path. This option is usually used with -update, indicating that files in the source path are synchronized to the target path and redundant files in the target path are deleted.
-diff <oldSnapshot> <newSnapshot>
Copies the differences between the old and new versions to a file of the old version at the target location.
-skipcrccheck
Determines whether to skip the cyclic redundancy check (CRC) between the source file and the target file.
-strategy {dynamic|uniformsize}
Specifies the copy policy. The default policy is uniformsize, that is, each task copies the same number of bytes.
- Data copy between clusters
For example, to copy data in /foo/bar of cluster1 to /bar/foo of cluster2, run the following command:
hadoop distcp hdfs://cluster1/foo/bar hdfs://cluster2/bar/foo
The network between cluster1 and cluster2 must be reachable, and the two clusters must use the same HDFS version or compatible HDFS versions.
- Data copy of multiple source directories
For example, to copy folders a and b in cluster1 to /bar/foo in cluster2, run the following command:
hadoop distcp hdfs://cluster1/foo/a \ hdfs://cluster1/foo/b \ hdfs://cluster2/bar/foo
Alternatively, you can run the following command:
hadoop distcp -f hdfs://cluster1/srclist \ hdfs://cluster2/bar/foo
The content of srclist is as follows. Before running the DistCp command, upload the srclist file to HDFS.
hdfs://cluster1/foo/a hdfs://cluster1/foo/b
update and overwrite Parameters
- -update copies files that only exist in the source path to the target path or updates the copied files in the target path.
When update is used, if the file to be copied already exists in the target path but the file content is different, the file content in the target path is updated.
- -overwrite overwrites the existing files in the target path.
When overwrite is used, if the file to be copied already exists in the target path, the file in the target path is still overwritten.
If files with the same name exist in multiple source paths, the DistCp command fails.
If neither update nor overwrite is used and the file to be copied already exists in the target path, the file will be skipped.
The following shows the differences between using no option and using either of the two options:
Assume that the structure of a file in the source path is as follows:
hdfs://cluster1/source/first/1 hdfs://cluster1/source/first/2 hdfs://cluster1/source/second/10 hdfs://cluster1/source/second/20
- Commands without options are as follows:
hadoop distcp hdfs://cluster1/source/first hdfs://cluster1/source/second hdfs://cluster2/target
The preceding command creates the first and second folders in the target path by default. Therefore, the copy result is as follows:hdfs://cluster2/target/first/1 hdfs://cluster2/target/first/2 hdfs://cluster2/target/second/10 hdfs://cluster2/target/second/20
- The command with any one of the two options (for example, update) is as follows:
hadoop distcp -update hdfs://cluster1/source/first hdfs://cluster1/source/second hdfs://cluster2/target
The preceding command copies only the content in the source path to the target path. Therefore, the copy result is as follows:hdfs://cluster2/target/1 hdfs://cluster2/target/2 hdfs://cluster2/target/10 hdfs://cluster2/target/20
Helpful Links
- If an error is reported when you use the DistCp command to copy an empty folder, see An Error Is Reported When DistCP Is Used to Copy an Empty Folder.
- If the DistCp command fails to be executed in a cluster in security mode and an exception occurs, see What Should I Do If an Error Is Reported When I Run DistCp Commands?.
- If some files fail to be copied and the error message "Source and target differ in block-size. Use -pb to preserve block-sizes during copy." is displayed, see Error Message "Source and target differ in block-size" Is Displayed When DistCp Is Used to Copy Files Across Clusters.
- The following lists common problems occurred during DistCp command execution:
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot