Updated on 2023-12-20 GMT+08:00

Migrating HDFS Data to OBS

Scenarios

In the Huawei Cloud big data solution with decoupled storage and compute, OBS serves as a unified data lake to provide storage. If your data is still stored in local HDFS, migrate HDFS data to OBS first.

You can use any of the following methods to migrate data: DistCp or CDM.

Migration Using DistCp

Hadoop DistCp (abbreviation of distributed copy) is a tool used for large inter- or intra-Hadoop cluster copying. It uses MapReduce to implement file distribution, error handling and recovery, and reporting. It puts a list of files and directories as the input of map tasks, and each task will copy some files specified in the source list.

Configuration

Configure OBS by referring to the hadoop-huaweicloud installation and configuration in Connecting Hadoop to OBS.

Example

  1. View the files and directories in an HDFS directory (/data/sample as an example) to migrate:

    hadoop fs -ls hdfs:///data/sample

  2. Migrate all files and directories inside /data/sample to the data/sample directory in OBS bucket obs-bigdata-posix-bucket:

    hadoop distcp hdfs:///data/sample obs://obs-bigdata-posix-bucket/data/sample

  3. View the file copies:

    hadoop fs -ls obs://obs-bigdata-posix-bucket/data/sample

Migration Using CDM

Cloud Data Migration (CDM) enables batch data migration among homogeneous and heterogeneous data sources, to realize flexible data flow. The data sources supported include relational databases, data warehouses, NoSQL, and big data cloud services.

For details, see What Is CDM?