Help Center/ MapReduce Service/ Best Practices/ Data Migration/ Migrating Data from Hadoop to MRS
Updated on 2022-12-13 GMT+08:00

Migrating Data from Hadoop to MRS

Scenario

This section describes how to migrate data from offline IDCs or public cloud Hadoop clusters to Huawei Cloud MRS. The data volume can be tens of TBs or less.

This section uses Huawei Cloud CDM 2.9.1.200 as an example to describe how to migrate data.

For details about the data sources supported by CDM, see Supported Data Sources. If the data source is Apache HDFS, the recommended version is 2.8.X or 3.1.X. Before performing the migration, ensure that the data source support migration.

Figure 1 Hadoop data migration

Solution Advantages

  • Easy-to-use: The wizard-based development interface frees you from programming but helps you develop migration tasks by simple configurations in minutes.
  • High migration efficiency: The performance of data migration and transmission is enhanced based on the distributed computing framework. Data write performance of specific data sources is optimized to improve data migration efficiency.
  • Real-time monitoring: During the migration, automatic real-time monitoring, alarms, and notifications can be performed.

Impact on the System

Migrating large volumes of data has high requirements on network communication. When a migration task is executed, other services may be affected. You are advised to migrate data during off-peak hours.

Procedure

  1. Log in to the CDM management console.
  2. Create a CDM cluster. The security group, VPC, and subnet of the CDM cluster must be the same as those of the destination cluster to ensure that the CDM cluster can communicate with the MRS cluster.
  3. On the Cluster Management page, locate the row containing the desired cluster and click Job Management in the Operation column.
  4. On the Links tab page, click Create Link.
  5. Add two HDFS links to the source cluster and destination cluster, respectively. For details, see Creating Links.

    Select a link type based on the actual cluster. For an MRS cluster, select MRS HDFS. For a self-built cluster, select Apache HDFS.

    Figure 2 HDFS link

  6. On the Table/File Migration tab page, click Create Job.
  7. Select the source and destination links.

    • Job Name: Enter a custom job name, which contains 1 to 256 characters consisting of letters, underscores (_), and digits.
    • Source Link Name: Select the HDFS link of the source cluster. Data is exported from this link when the job is running.
    • Destination Link Name: Select the HDFS link of the destination cluster. Data is imported to this link when the job is running.

  8. Configure source job parameters by referring to From HDFS. You can set Directory Filter and File Filter to specify the directories and files to be migrated. For example, if Path Filter is set to test*, files in the /user/test* folder will be migrated. In this scenario, File Format is fixed to Binary.

    Figure 3 Configuring job parameters

  9. Configure destination job parameters by referring to To HDFS.
  10. Click Next. The task configuration page is displayed.

    • If you need to periodically migrate new data to the destination cluster, configure a scheduled task on this page. Alternatively, you can configure a scheduled task later by referring to 14.
    • If no new data needs to be migrated periodically, skip the configurations on this page and click Save.
      Figure 4 Task configuration

  11. Choose Job Management and click the Table/File Migration tab. Click Run in the Operation column of the job to be executed to start migrating HDFS data. Wait until the job execution is complete.
  12. Log in to the active management node of the destination cluster.
  13. Run the hdfs dfs -ls -h /user/ command to view the migrated files in the destination cluster.
  14. (Optional) If new data in the source cluster needs to be periodically migrated to the destination cluster, configure a scheduled task for incremental data migration until all services are migrated to the destination cluster.

    1. On the Cluster Management page of the CDM console, choose Job Management and click the Table/File Migration tab.
    2. In the Operation column of the migration job, click More and select Configure Scheduled Execution.
    3. Enable the scheduled job execution function, set the execution cycle based on service requirements and the end time of the validity period to the time after all services are migrated to the new cluster.
      Figure 5 Scheduling job execution