Updated on 2024-10-23 GMT+08:00

Configuring an Offline Processing Migration Job

When creating an offline processing data migration job, you can select the source and destination data and configure parameters to periodically synchronize all or incremental data of a table, database/table partition, or entire database from the source to a destination table.

This section describes the common configurations of an offline processing migration job. The configuration varies depending on the data source. For details, see Configuring Source Job Parameters and Configuring Destination Job Parameters.

Notes and Constraints

The field type and precision of the source must be the same as those of the destination. Otherwise, the job may fail to run.

Pay attention to the precision of the field types at the source and destination. If the maximum value of the field type at the destination is less than the maximum value at the source (or the minimum value of the field type at the destination is greater than the minimum value of the field type at the source, or the precision is lower than the precision at the source), the write may fail or the precision may be truncated.

Prerequisites

  • A data connection has been created, and DataArts Migration has been selected for the connection. For details, see Creating a DataArts Studio Data Connection.
  • A CDM cluster is running. For details, see Creating a CDM Cluster.

    If the CDM cluster provided by the DataArts Studio instance (except the trial version) meets your requirements, you do not need to buy a DataArts Migration incremental package. If you need to create another CDM cluster, buy a CDM incremental package by referring to Buying a CDM Incremental Package.

  • The CDM cluster can communicate with the data source.
    • If the CDM cluster and a cloud service are in the same region, VPC, subnet, and security group, they can communicate with each other through an intranet.
    • If the CDM cluster and the cloud service are in the same region and VPC but in different subnets or security groups, you must configure routing rules and security group rules. For details about how to configure routing rules, see Configuring Routing Rules. For details about how to configure security group rules, see Configuring Security Group Rules.
    • If the CDM cluster and a cloud service are in different VPCs of the same region, you can create a VPC peering connection to enable them to communicate with each other. For details about how to configure a VPC peering connection, see VPC Peering Connection

      Note: If a VPC peering connection is created, the peer VPC subnet may overlap with the CDM management network. As a result, data sources in the peer VPC cannot be accessed. You are advised to use the Internet for cross-VPC data migration, or contact the administrator to add specific routes for the VPC peering connection in the CDM background.

    • If the CDM cluster and a cloud service are located in different regions, you need to use the Internet or Direct Connect to enable them to communicate with each other. When using the Internet, ensure that an EIP has been bound to the CDM cluster, the security group of CDM allows outbound traffic from the host where the off-cloud data source is located, the host where the data source is located can access the Internet, and the connection port has been enabled in the firewall rules.
    • In addition, an enterprise project may also affect the communication between the CDM cluster and other cloud services. The CDM cluster can communicate with a cloud service only if they have the same enterprise project.

Procedure

  1. Create an offline processing migration job by referring to Creating an Offline Processing Migration Job.
  2. Configure types.
    Figure 1 Configuring types
    1. Set the source connection type and destination connection type. For details about supported data sources, see Supported Data Sources.
    2. Set Migration Job Type.
      1. Migration Type: The default value is Offline and cannot be changed.
      2. Migration Scenario: Select one from Single table, Database/Table partition, and Entire DB Migration. For details about the supported data sources, see Supported Data Sources.
    3. Configure Network Resource Configuration.
      1. Select a created source data connection (DataArts Migration was selected for the connection). If no connection is available, create one by referring to Creating a DataArts Studio Data Connection.

        Check whether the source and the resource group can communicate with each other. If they cannot, modify the network settings as prompted.

      2. Select a resource group. For details about how to create a cluster, see Creating a CDM Cluster.

        If multiple clusters are selected, the system randomly delivers tasks. Therefore, you are advised to select clusters of the same version. Otherwise, the job may fail due to inconsistent cluster versions.

      3. Select a created destination data connection (DataArts Migration was selected for the connection). If no connection is available, create one by referring to Creating a DataArts Studio Data Connection.

        Check whether the data connection is available. If the data connection is unavailable, change another one as prompted.

  3. Configure source parameters.
    The configuration varies depending on the data source and synchronization scenario. After selecting a source connection, configure job parameters by referring to Configuring Source Job Parameters.
    Table 1 Required source job parameters

    Scenario

    Required Source Parameters

    Field Mapping

    Single table

    • Basic parameters
    • Advanced attributes

    Supported

    Database/Table partition

    • Database/Table mode, exact match or regular expression match
    • Advanced attributes

    Supported

    Entire DB migration

    • Database tables to be migrated
    • Advanced attributes

    Not supported

  4. Configure destination parameters.
    The configuration varies depending on the data source and synchronization scenario. After selecting a destination connection, configure job parameters by referring to Configuring Destination Job Parameters.
    Table 2 Required destination job parameters

    Scenario

    Required Destination Parameters

    Field Mapping

    Single table

    • Basic parameters
    • Advanced attributes

    Supported

    Database/Table partition

    • Basic parameters
    • Advanced attributes

    Supported

    Entire DB migration

    Database and table matching policy

    Not supported

  5. Configuring field mapping.

    After configuring source and destination parameters, you need to configure the mapping between source and destination columns. After the field mapping is configured, the job writes source fields to fields of the corresponding types at the destination based on the field mapping.

    1. Field mapping configuration: Set the field mapping mode and batch field mapping rule.
      • Field Mapping Mode
        • Same name: Fields with the same name are mapped. Fields with the same column name are automatically mapped.
        • Same row: Fields with different names but in the same row of the source and destination table are mapped. Source and destination fields in the same row are automatically mapped.
      • Batch Field Mapping: This parameter is not displayed when Use SQL Statement in the source configuration is set to Yes.

        Enter field mappings, with one field mapping in each row. Place fields from the source table to the left of the equal sign (=) and fields from the destination table to the right of the equal sign (=), for example, reader_column=writer_column.

        Click View and Edit to set the batch field mapping.

    2. Field mapping: batch conversion, field adding, and row moving

      • Set Converter: Convert source fields in batches.

        Select the target fields and click Set Converter. In the displayed dialog box, create a converter as prompted.

        Delete Field: This parameter is unavailable when Use SQL Statement in the source configuration is set to Yes. Select the target fields and click Delete Field.

        You can view the deleted fields in Removed Fields in the Add Field dialog box.

      • Add Field: This parameter is unavailable when Use SQL Statement in the source configuration is set to Yes. You can add new fields or removed fields to the source and destination configurations.

        The following types of fields are supported:

        Functions, for example, now(), curdate(), or postgresql for MySQL.

        now() or transaction_timestamp()

        Functions with keywords, for example, to_char(current_date,'yyyy-MM-dd') for PostgreSQL

        Fixed values, such as 123 and '123' (both indicate string 123)

        Variable values, for example, ${workDate} (workDate must be defined in the job variable.)

        Fixed variables for JDBC, such as DB_NAME_SRC (source database name), TABLE_NAME_SRC (source table name), and DATASOURCE_NAME_SRC (data source name)

        as statements are supported, such as '123' as test and now() as curTime.

      • Move rows: This function is unavailable when Use SQL Statement in the source configuration is set to Yes. Drag the row of a field and move the row up or down.
      • View converters: (Optional) CDM can convert fields. Click and then click Create Converter. For details about how to use converters, see Configuring Field Converters.
      • Search for and match destination fields: Click in the Operation column. In the displayed dialog box, search for a field by keyword or click a field directly.
      • Delete fields: You can delete the default fields of the table. To delete a field, click in the Operation column. Removed fields can be found in Removed Fields in the Add Field dialog box.
      • Example Field Mapping: This parameter is not displayed when Use SQL Statement in the source configuration is set to Yes. You can view the example mapping of source and destination fields.
      • If files are migrated between FTP, SFTP, HDFS, and OBS and the migration source's File Format is set to Binary, files will be directly transferred, free from field mapping.

        Field mapping configuration is not required for entire DB migration.

      • During the migration, the field types at the source and destination may not match. As a result, dirty data is generated, and data cannot be written to the destination. For details about the number of dirty data records allowed during the migration, see the next step.
      • If a field at the source is not mapped to a field at the destination, the field at the source will not be synchronized to the destination.
      • In other scenarios, CDM automatically maps fields of the source table and the destination table. You need to check whether the mapping and time format are correct. For example, check whether the source field type can be converted into the destination field type.
      • If the field mapping is incorrect, you can drag fields to adjust the mapping. (This function is supported when Use SQL Statement in the source configuration is set to No.)
      • If you cannot obtain all columns by obtaining sample values on the field mapping page, you can click to add a custom field or click in the Operation column to create a field converter to ensure that all required data can be imported to the destination.
      • In the Field Mapping area, you can click to add custom constants, variables, and expressions.
      • The column name is available only the Extract first row as columns parameter is set to Yes during the migration of a CSV file from OBS.
      • When SQLServer is the destination, fields of the timestamp type cannot be written. You must change their type (for example, to datatime) so that they can be written.
      • If the data is imported to DWS, you need to select the distribution columns in the destination fields. You are advised to select the distribution columns according to the following principles:
        1. Use the primary key as the distribution column.
        2. If multiple data segments are combined as primary keys, specify all primary keys as the distribution column.
        3. In the scenario where no primary key is available, if no distribution column is selected, DWS uses the first column as the distribution column by default. As a result, data skew risks exist.

  6. Configuring task properties.

    You can configure the parameters listed in Table 3 for data synchronization.

    Table 3 Task parameters

    Parameter

    Description

    Example Value

    Concurrent Extractors

    Number of tasks to be split from a CDM job.

    CDM migrates data through data migration jobs. It works in the following way:
    1. When data migration jobs are submitted, CDM splits each job into multiple tasks based on the Concurrent Extractors parameter in the job configuration.
      NOTE:

      Jobs for different data sources may be split based on different dimensions. Some jobs may not be split based on the Concurrent Extractors parameter.

    2. CDM submits the tasks to the running pool in sequence. Tasks (defined by Maximum Concurrent Extractors) run concurrently. Excess tasks are queued.

    By setting appropriate values for parameters Concurrent Extractors and Maximum Concurrent Extractors, you can accelerate migration.

    Configure the number of concurrent extractors based on the following rules:

    1. When data is to be migrated to files, CDM does not support multiple concurrent tasks. In this case, set a single process to extract data.
    2. If each row of the table contains less than or equal to 1 MB data, data can be extracted concurrently. If each row contains more than 1 MB data, it is recommended that data be extracted in a single thread.
    3. Set Concurrent Extractors for a job based on Maximum Concurrent Extractors for the cluster. It is recommended that Concurrent Extractors is less than Maximum Concurrent Extractors.
    4. If the destination is DLI, you are advised to set the number of concurrent extractors to 1. Otherwise, data may fail to be written.

    The maximum number of concurrent extractors for a cluster varies depending on the CDM cluster flavor. You are advised to set the maximum number of concurrent extractors to twice the number of vCPUs of the CDM cluster. For example, the maximum number of concurrent extractors for a cluster with 8 vCPUs and 16 GB memory is 16.

    3

    Number of split retries

    Number of retries when a split fails to be executed. Value 0 indicates that no retry will be performed.

    NOTE:

    This parameter takes effect only when the destination is Hudi or DWS and the import mode is UPSERT.

    0

    Write Dirty Data

    Whether to record dirty data. By default, dirty data is not recorded. If there is a large amount of dirty data, the synchronization speed of the task is affected.

    • No: Dirty data is not recorded. This is the default value.

      Dirty data is not allowed. If dirty data is generated during the synchronization, the task fails and exits.

    • Yes: Dirty data is allowed, that is, dirty data does not affect task execution.
      When dirty data is allowed and its threshold is set:
      • If the generated dirty data is within the threshold, the synchronization task ignores the dirty data (that is, the dirty data is not written to the destination) and is executed normally.
      • If the generated dirty data exceeds the threshold, the synchronization task fails and exits.
        NOTE:

        Criteria for determining dirty data: Dirty data is meaningless to services, is in an invalid format, or is generated when the synchronization task encounters an error. If an exception occurs when a piece of data is written to the destination, this piece of data is dirty data. Therefore, data that fails to be written is classified as dirty data.

        For example, if data of the VARCHAR type at the source is written to a destination column of the INT type, dirty data cannot be written to the migration destination due to improper conversion. When configuring a synchronization task, you can configure whether to write dirty data during the synchronization and configure the number of dirty data records (maximum number of error records allowed in a single partition) to ensure task running. That is, when the number of dirty data records exceeds the threshold, the task fails and exits.

    No

    Write Dirty Data Link

    This parameter is only displayed when Write Dirty Data is set to Yes.

    Only links to OBS support dirty data writes.

    obslink

    OBS Bucket

    This parameter is only displayed when Write Dirty Data Link is a link to OBS.

    Name of the OBS bucket to which the dirty data will be written.

    dirtydata

    Dirty Data Directory

    This parameter is only displayed when Write Dirty Data is set to Yes.

    Directory for storing dirty data on OBS. Dirty data is saved only when this parameter is configured.

    You can go to this directory to query data that fails to be processed or is filtered out during job execution, and check the source data that does not meet conversion or cleaning rules.

    /user/dirtydir

    Max. Error Records in a Single Shard

    This parameter is only displayed when Write Dirty Data is set to Yes.

    When the number of error records of a single partition exceeds the upper limit, the job will automatically terminate and the imported data cannot be rolled back.

    You are advised to use a temporary table as the destination table. After the data is imported, rename the table or combine it into the final data table.

    0

    Throttling

    Whether to enable throttling for the synchronization. This rate indicates the CDM transmission rate, not the NIC traffic.

    • Yes: By limiting the synchronization rate, you can prevent the source database from being overloaded due to a high extraction speed. The minimum rate allowed is 1 MB/s.
    • No: The task provides the highest transmission performance with the existing hardware based on the configured maximum number of concurrent tasks.
    NOTE:
    • It can control the rate for a job migrating data to MRS Hive, DLI, relational databases, OBS, or Apache HDFS.
    • To configure throttling for multiple jobs, multiply the rate by the number of concurrent jobs.

    Yes

    byteRate(MB/s)

    Maximum rate for a job. To configure throttling for multiple jobs, multiply the rate by the number of concurrent jobs.

    NOTE:

    The rate is an integer greater than 1.

    10

  7. Save the job.

    After configuring the job, click Save in the upper left corner to save the job configuration.

    After the job is saved, a version is automatically generated and displayed in Versions. The version can be rolled back. If you save a job multiple times within a minute, only one version is recorded. If the intermediate data is important, you can click Save new version to save and add a version.

  8. Test the job.
    After configuring the job, click Test in the upper left corner to test the job. If the test fails, view the logs of the job node and locate and rectify the fault.
    • A test execution is similar to a single execution and migrates data.
    • You can view the test run logs of the job by clicking View Log.
    • If you test the job before submitting a version, the version of the generated job instance is 0 on the Job Monitoring page.
  9. Submit a job version.

    If you want the job to be scheduled periodically, you need to release the job to the production environment. For how to release a job, see Releasing a Job Task.

  10. Schedule the job.

    Set the scheduling mode for the job. For details about how to schedule the job, see Setting Up Scheduling for a Job.