Scenario-based Migration

Scenario-based migration migrates snapshots and then restores table data to speed up migration.

Prerequisites

  • The CDM cluster can communicate with the data source.
  • You have obtained the URL and the account for accessing the data source. The account is granted with the read and write permissions for the data source.

Link to Hadoop

CDM supports the following Hadoop data sources:

MRS

When connecting CDM to Hadoop of MRS, configure the parameters as described in Table 1.

Table 1 MRS Hadoop link parameters

Parameter

Description

Example Value

Name

Link name, which should be defined based on the data source type, so it is easier to remember what the link is for

mrs_scen_link

Manager IP

IP address of MRS Manager. Click Select next to the Manager IP text box to select an MRS cluster. CDM automatically fills in the authentication information.

127.0.0.1

Authentication Method

Authentication method used for accessing MRS
  • SIMPLE: for non-security mode
  • KERBEROS: for security mode

SIMPLE

HBase Version

Set it to the HBase version on the server.

HBASE_2_X

HIVE Version

Set it to the Hive version on the server.

HIVE_3_X

Username

If Authentication Method is set to KERBEROS, you must provide the username and password used for logging in to MRS Manager.

If you need to create a snapshot when exporting a directory from HDFS, the user configured here must have the administrator permission on HDFS.

cdm

Password

Password used for logging in to MRS Manager

-

Run Mode

Run mode of the HDFS link. The options are as follows:
  • EMBEDDED: The link instance runs with CDM. This mode delivers better performance.
  • STANDALONE: The link instance runs in an independent process. If CDM needs to connect to multiple Hadoop data sources (MRS, Hadoop, or CloudTable) with both Kerberos and Simple authentication modes, select STANDALONE or configure different agents.
  • Agent: The link instance runs on an agent.

If STANDALONE is selected, CDM can migrate data between HDFSs of multiple MRS clusters.

STANDALONE

FusionInsight Hadoop

When connecting CDM to Hadoop of FusionInsight HD, configure the parameters as described in Table 2.

Table 2 FusionInsight Hadoop link parameters

Parameter

Description

Example Value

Name

Link name, which should be defined based on the data source type, so it is easier to remember what the link is for

FI_hdfs_link

Manager IP

IP address of FusionInsight Manager

127.0.0.1

Manager Port

Port number of FusionInsight Manager

28443

CAS Server Port

Port number of the CAS server used to connect to FusionInsight

20009

Username

Username used for logging in to FusionInsight Manager

If you need to create a snapshot when exporting a directory from HDFS, the user configured here must have the administrator permission on HDFS.

cdm

Password

Password used for logging in to FusionInsight Manager

-

Authentication Method

Authentication method used for accessing FusionInsight HD
  • SIMPLE: for non-security mode
  • KERBEROS: for security mode

KERBEROS

HBase Version

Set it to the HBase version on the server.

HBASE_2_X

HIVE Version

Set it to the Hive version on the server.

HIVE_3_X

Run Mode

Run mode of the HDFS link. The options are as follows:
  • EMBEDDED: The link instance runs with CDM. This mode delivers better performance.
  • STANDALONE: The link instance runs in an independent process. If CDM needs to connect to multiple Hadoop data sources (MRS, Hadoop, or CloudTable) with both Kerberos and Simple authentication modes, select STANDALONE or configure different agents.

    Note: The STANDALONE mode is used to solve the version conflict problem. If the connector versions of the source and destination ends of the same link are different, a JAR file conflict occurs. In this case, you need to place the source or destination end in the STANDALONE process to prevent the migration failure caused by the conflict.

  • Agent: The link instance runs on an agent.

STANDALONE

Apache Hadoop

When connecting CDM to Apache Hadoop, configure parameters as described in Table 3.

Table 3 Apache Hadoop link parameters

Parameter

Description

Example Value

Name

Link name, which should be defined based on the data source type, so it is easier to remember what the link is for

hadoop_hdfs_link

URI

NameNode URI

hdfs://nn1.example.com/

ZooKeeper Address

ZooKeeper address, which needs to be configured for HBase scenario-based migration

hbase-node-1:2181

Hive Metastore

Hive metadata address. For details, see the hive.metastore.uris configuration item.

thrift://host-192-168-1-212:9083

Authentication Method

Authentication method used for accessing Hadoop
  • SIMPLE: Select this if Hadoop is in non-security mode.
  • KERBEROS: Select this if Hadoop is in security mode. Obtain the Principal account and Keytab File file of the client for authentication.

KERBEROS

Principal

When Authentication Method is set to KERBEROS, the Principal account is used for authentication. You can contact the Hadoop administrator to obtain the account.

USER@YOUR-REALM.COM

Keytab File

When Authentication Method is set to KERBEROS, this file is used for authentication. You can contact the Hadoop administrator to obtain the file.

/opt/user.keytab

IP and Host Name Mapping

If the HDFS configuration file uses the host name, configure the mapping between the IP address and host name. Separate the IP addresses and host names by spaces and mappings by semicolons (;), carriage returns, or line feeds.

10.1.6.9 hostname01

10.2.7.9 hostname02

HBase Version

Set it to the HBase version on the server.

HBASE_2_X

HIVE Version

Set it to the Hive version on the server.

HIVE_3_X

Run Mode

Run mode of the HDFS link. The options are as follows:
  • EMBEDDED: The link instance runs with CDM. This mode delivers better performance.
  • STANDALONE: The link instance runs in an independent process. If CDM needs to connect to multiple Hadoop data sources (MRS, Hadoop, or CloudTable) with both Kerberos and Simple authentication modes, select STANDALONE or configure different agents.

    Note: The STANDALONE mode is used to solve the version conflict problem. If the connector versions of the source and destination ends of the same link are different, a JAR file conflict occurs. In this case, you need to place the source or destination end in the STANDALONE process to prevent the migration failure caused by the conflict.

  • Agent: The link instance runs on an agent.

STANDALONE

Procedure

  1. Log in to the CDM management console.
  2. In the left navigation pane, click Cluster Management. Locate the target cluster and click Job Management.
  3. Choose Job Management > Link Management > Create Link and set the connector type to Hadoop release version.
  4. Click Next. Set link parameters by referring to Link to Hadoop.
  5. Click Test to check whether the link is available. Alternatively, click Save. The system will automatically check whether the link is available.

    If the network is poor or the data source is too large, the link test may take 30 to 60 seconds.

  6. Choose Scenario Migration > Create Job. The page for configuring the job is displayed. Select a migration scenario (Hadoop migration, Hive migration, or HBase migration) and configure the job name.

    Figure 1 Configuring a scenario-based migration Job

  7. Configure the source and destination job parameters, and select the link name and name of the database to be migrated.

    Figure 2 Configuring job parameters

  8. Click Next to access the page for selecting tables. You can select the tables to be migrated to the migration destination based on your requirements.
  9. Click Next and set job parameters.

    Table 4 describes related parameters.
    Table 4 Task configuration parameters

    Parameter

    Description

    Example Value

    Write Dirty Data

    Whether to record dirty data. By default, this parameter is set to No.

    Yes

    Write Dirty Data Link

    This parameter is only displayed when Write Dirty Data is set to Yes.

    Only links to OBS support dirty data writes.

    obs_link

    OBS Bucket

    This parameter is only displayed when Write Dirty Data Link is a link to OBS.

    Name of the OBS bucket to which the dirty data will be written.

    dirtydata

    Dirty Data Directory

    This parameter is only displayed when Write Dirty Data is set to Yes.

    Directory for storing dirty data on OBS. Dirty data is saved only when this parameter is configured.

    You can go to this directory to query data that fails to be processed or is filtered out during job execution, and check the source data that does not meet conversion or cleaning rules.

    /user/dirtydir

    Max. Error Records in a Single Shard

    This parameter is only displayed when Write Dirty Data is set to Yes.

    When the number of error records of a single map exceeds the upper limit, the job will automatically terminate and the imported data cannot be rolled back. You are advised to use a temporary table as the destination table. After the data is imported, rename the table or combine it into the final data table.

    0

  10. Click Save or Save and Run.

    When the job starts running, a sub-job will be generated for each table. You can click the job name to view the sub-job list.