Help Center/ DataArts Studio/ Best Practices/ Creating Table Migration Jobs in Batches Using CDM Nodes
Updated on 2023-03-03 GMT+08:00

Creating Table Migration Jobs in Batches Using CDM Nodes

Scenario

In a service system, data sources are usually stored in different tables to reduce the size of a single table in complex application scenarios.

In this case, you need to create a data migration job for each table when using CDM to integrate data. This tutorial describes how to use the For Each and CDM nodes provided by the DataArts Factory module to create table migration jobs in batches.

In this tutorial, the source MySQL database has three tables, mail01, mail02, and mail03. The tables have the same structure but different data content. The destination is MRS Hive.

Prerequisites

  • You have created a CDM cluster.
  • MRS Hive has been enabled.
  • Databases and tables have been created in MRS Hive.

Creating a Link

  1. Log in to the DataArts Studio console, locate the target DataArts Studio instance, and click Access on the instance card.
  2. Locate a workspace and click DataArts Migration.
  3. In the Operation column, click Job Management.
  4. Click the Links tab and then Driver Management. Upload the MySQL database driver by following the instructions in Managing Drivers.
  5. Click the Links tab and then Create Link. Select MySQL and click Next to configure parameters for the link. After the configuration is complete, click Save to return to the Links page.

    Figure 1 Configuring the MySQL link
    Table 1 MySQL link parameters

    Parameter

    Description

    Example

    Name

    Link name, which should be defined based on the data source type, so it is easier to remember what the link is for

    mysql

    Database Server

    IP address or domain name of the database to connect

    192.168.0.1

    Port Number

    Port of the database to connect

    3306

    Database

    Name of the database to connect

    mysql

    Username

    Username used for accessing the database This account must have the permissions required to read and write data tables and metadata.

    root

    Password

    Password of the username

    -

    Use Agent

    Whether to extract data from the data source through an agent

    Disabled

  6. Click the Links tab and then Create Link. Select MRS Hive and click Next to configure parameters for the link. After the configuration is complete, click Save to return to the Links page.

    Figure 2 Configuring the MRS Hive link
    Table 2 MRS Hive link parameters

    Parameter

    Remarks

    Example

    Metric Name

    Link name, which should be defined based on the data source type, so it is easier to remember what the link is for

    hive

    Manager IP

    Floating IP address of MRS Manager. Click Select next to the Manager IP text box to select an MRS cluster. CDM automatically fills in the authentication information.

    127.0.0.1

    Authentication Method

    Authentication method used for accessing MRS
    • SIMPLE: Select this for non-security mode.
    • KERBEROS: Select this for security mode.

    KERBEROS

    Hive Version

    Hive version. Set it to the Hive version on the server.

    HIVE_3_X

    Username

    If Authentication Method is set to KERBEROS, you must provide the username and password used for logging in to MRS Manager. If you need to create a snapshot when exporting a directory from HDFS, the user configured here must have the administrator permission on HDFS.

    To create a data connection for an MRS security cluster, do not use user admin. The admin user is the default management page user and cannot be used as the authentication user of the security cluster. You can create an MRS user and set Username and Password to the username and password of the created MRS user when creating an MRS data connection.
    NOTE:
    • If the CDM cluster version is 2.9.0 or later and the MRS cluster version is 3.1.0 or later, the created user must have the permissions of the Manager_viewer role to create links on CDM. To perform operations on databases, tables, and data of a component, you also need to add the user group permissions of the component to the user.
    • If the CDM cluster version is earlier than 2.9.0 or the MRS cluster version is earlier than 3.1.0, the created user must have the permissions of Manager_administrator, Manager_tenant, or System_administrator to create links on CDM.

    cdm

    Password

    Password for logging in to MRS Manager

    -

    OBS storage support

    The server must support OBS storage. When creating a Hive table, you can store the table in OBS.

    Disabled

    Run Mode

    This parameter is used only when the Hive version is HIVE_3_X. Possible values are:
    • EMBEDDED: The link instance runs with CDM. This mode delivers better performance.
    • STANDALONE: The link instance runs in an independent process. If CDM needs to connect to multiple Hadoop data sources (MRS, Hadoop, or CloudTable) with both Kerberos and Simple authentication modes, select STANDALONE or configure different agents.

      Note: The STANDALONE mode is used to solve the version conflict problem. If the connector versions of the source and destination ends of the same link are different, a JAR file conflict occurs. In this case, you need to place the source or destination end in the STANDALONE process to prevent the migration failure caused by the conflict.

    EMBEDDED

    Use Cluster Config

    You can use the cluster configuration to simplify parameter settings for the Hadoop connection.

    Disabled

Creating a Sample Job

  1. Create a job for migrating the first MySQL table mail001 to the MRS Hive table mail.

    Note: Select Do not Delete for the Delete Job After Completion parameter.

  2. After the sample job is created, view and copy the job JSON for subsequent configuration of data development jobs.

Creating a Data Development Job

  1. Locate a workspace and click DataArts Factory.
  2. Create a subjob named table, select the CDM node, select New jobs for Job Type in Properties, and copy and paste the JSON file in Step 2 to the CDM Job Message Body.

  3. Edit the CDM job message body.

    1. Since there are three source tables mail001, mail002, and mail003, you need to set fromJobConfig.tableName to mail${num} in the JSON file of the job. The following figure shows the parameters for creating a main job.

    2. The name of each data migration job must be unique. Therefore, you need to change the value of name in the JSON file to mail${num} to create multiple CDM jobs. The following figure shows the parameters for creating a main job.

      To create a sharding job, you can change the source link in the job JSON file to a variable that can be easily replaced.

  4. Add the num parameter, which is invoked in the job JSON file. The following figure shows the parameters for creating a main job.

    Click Save and Submit to save the subjobs.

  5. Create the main job integration_management. Select the For Each node that executes the subjobs in a loop and transfers parameters 001, 002, and 003 to the subjobs to generate different table extraction tasks.

    The key configurations are as follows:
    • Subjob in a Loop: Select table.
    • Dataset: Enter [['001'],['002'],['003']].
    • Job Running Parameter: Enter @@#{Loop.current[0]}@@.

      Add @@ to the EL expression of the job running parameter. If @@ is not added, dataset 001 will be identified as 1. As a result, the source table name does not exist.

    The following figure shows the parameters for creating a main job.

    Click Save and Submit to save the main job.

  6. After the main job and subjobs are created, test and run the main job to check whether it is successfully created. If the job is successfully executed, the CDM subjobs are successfully created and executed.

Important Notes

  • Some attributes, such as fromJobConfig.BatchJob, may not be supported in some CDM versions. If an error is reported during task creation, you need to delete the attribute from the request body. The following figure shows the parameters for creating a main job.

  • If a CDM node is configured to create a job, the node checks whether a CDM job with the same name is running.
    • If the CDM job is not running, update the job with the same name based on the request body.
    • If a CDM job with the same name is running, wait until the job is run. During this period, the CDM job may be started by other tasks. As a result, the extracted data may not be the same as expected (for example, the job configuration is not updated, or the macro of the running time is not correctly replaced). Therefore, do not start or create multiple jobs with the same name.