Table/File Migration Jobs

Scenario

CDM can migrate tables or files between homogeneous and heterogeneous data sources. For details about data sources that support table/file migration, see Data Sources Supported by CDM.

CDM is applicable to data migration to the cloud, data exchange on the cloud, and data migration to on-premises service systems.

Prerequisites

  • You have created links based on the instructions in Creating Links.
  • The CDM cluster can communicate with the data source.

Procedure

  1. Log in to the CDM management console.
  2. In the left navigation pane, click Cluster Management. Locate the target cluster and click Job Management.
  3. Choose Table/File Migration > Create Job. The page for configuring the job is displayed.

    Figure 1 Creating a migration job

  4. Select the source and destination links.

    • Job Name: Enter a custom job name, which is a string of 1 to 256 characters chosen from letters, underscores (_), and digits, for example, oracle2obs_t.
    • Source Link Name: Select the data source from which data will be exported.
    • Destination Link Name: Select the data source to which data will be imported.

    If no link is available, click + or go to the Link Management page to create one. For details about how to create a link, see Creating Links.

  5. Configure the source link parameters. Figure 2 shows the job configurations for migrating MySQL to DWS.

    Figure 2 Creating a job

    The parameters vary with data sources. For details about the job parameters of other types of data sources, see Table 1 and Table 2.

    Table 1 Source link parameter description

    Migration Source

    Description

    Parameter Settings

    • OBS
    • Alibaba Cloud OSS
    • KODO
    • COS

    Data can be extracted in CSV, JSON, or binary format. Data extracted in binary format is free from file resolution, which ensures high performance and is more suitable for file migration.

    Currently, data cannot be imported to Alibaba Cloud OSS, KODO, and COS.

    For details, see From OBS/OSS/KODO/COS/S3.

    • MRS HDFS
    • FusionInsight HDFS
    • Apache HDFS

    HDFS data can be exported in CSV, Parquet, or binary format and can be compressed in multiple formats.

    For details, see From HDFS.

    • MRS HBase
    • FusionInsight HBase
    • Apache HBase
    • CloudTable Service

    Data can be exported from MRS, FusionInsight HD, open source Apache Hadoop HBase, or CloudTable. You need to know all column families and field names of HBase tables.

    For details, see From HBase/CloudTable.

    MRS Hive

    Data can be exported from Hive through the JDBC API.

    If the data source is Hive, CDM will automatically partition data using the Hive data partitioning file.

    For details, see From Hive.

    • FTP
    • SFTP
    • Network Attached Storage
    • SFS Turbo

    FTP, SFTP, NAS, or SFS data can be exported in CSV, JSON, or binary format.

    For details, see From FTP/SFTP/NAS/SFS.

    • HTTP
    • HTTPS

    These connectors are used to read files with an HTTP/HTTPS URL, such as reading public files on the third-party object storage system and web disks.

    Currently, data can only be exported from the HTTP/HTTPS URLs.

    For details, see From HTTP/HTTPS.

    • Data Warehouse Service
    • RDS for MySQL
    • RDS for SQL Server
    • RDS for PostgreSQL
    • Dameng database

    Data can be exported from the cloud database services.

    When data is exported from these data sources, CDM uses the JDBC API to extract data. The job parameters for the migration source are the same. For details, see From a Relational Database.

    • FusionInsight LibrA
    • Derecho (GaussDB)

    Data can be exported from FusionInsight LibrA and Derecho.

    • MySQL
    • PostgreSQL
    • Oracle
    • IBM Db2
    • Microsoft SQL Server

    The non-cloud databases can be those created in the on-premises data center or deployed on ECSs, or database services on the third-party clouds.

    • MongoDB
    • Document Database Service

    Data can be exported from MongoDB or DDS.

    For details, see From MongoDB/DDS.

    Redis

    Data can be exported from open source Redis.

    For details, see From Redis.

    Data Ingestion Service

    Currently, data can only be exported from DIS to CSS, Apache Kafka, or DMS Kafka.

    For details, see From DIS.

    • Apache Kafka
    • DMS Kafka

    Currently, data can only be exported from Kafka to CSS, DIS, or DMS Kafka.

    For details, see From Apache Kafka/DMS Kafka.

    • Cloud Search Service
    • Elasticsearch

    Data can be exported from CSS or Elasticsearch.

    For details, see From Elasticsearch or CSS.

  6. Configure job parameters for the migration destination based on Table 2.

    Table 2 Parameter description

    Migration Destination

    Description

    Parameter Settings

    OBS

    Files (even in a large volume) can be batch migrated to OBS in CSV or binary format.

    For details, see To OBS.

    • MRS HDFS
    • FusionInsight HDFS
    • Apache HDFS

    You can select a compression format when importing data to HDFS.

    For details, see To HDFS.

    • MRS HBase
    • FusionInsight HBase
    • Apache HBase
    • CloudTable Service

    Data can be imported to HBase. The compression algorithm can be set when a new HBase table is created.

    For details, see To HBase/CloudTable.

    MRS Hive

    Data can be rapidly imported to MRS Hive.

    For details, see To Hive.

    • FTP
    • SFTP
    • Network Attached Storage
    • SFS Turbo

    When FTP/SFTP/NAS servers function as the migration destination, CDM usually migrates cloud data analysis results back to local file systems.

    For details, see To FTP/SFTP/NAS/SFS.

    • Data Warehouse Service
    • RDS for MySQL
    • RDS for SQL Server
    • RDS for PostgreSQL

    Data can be imported to cloud database services.

    For details about how to use the JDBC API to import data, see To a Relational Database.

    • When importing data to DWS, specify the COPY or GDS import mode to improve the import performance. You can specify the Import Mode parameter when creating a DWS link.
    • When importing data to RDS for MySQL, enable the LOAD DATA function of MySQL to accelerate data import and improve the import performance. You can configure Use Local API to enable the function when you create a MySQL link.

    FusionInsight LibrA

    Data can be imported to FusionInsight LibrA but cannot be imported to Derecho (GaussDB).

    MySQL

    The data source can be the on-premises MySQL, MySQL built on ECSs, or MySQL on the third-party cloud.

    Document Database Service

    Data can be imported to the DDS but cannot be imported to the local MongoDB.

    For details, see To DDS.

    Distributed Cache Service

    Data can be imported to DCS in the String or Hashmap value type. Data cannot be imported to the local Redis.

    For details, see To DCS.

    • Cloud Search Service
    • Elasticsearch

    Data can be imported to Elasticsearch or CSS.

    For details, see To Elasticsearch or CSS.

    Data Lake Insight

    Data can be imported to DLI.

    For details, see To DLI.

  7. After the parameters are configured, click Next. The Map Field tab page is displayed.

    If files are migrated between FTP, SFTP, NAS, HDFS, and OBS and the migration source's File Format is set to Binary, files will be directly transferred, free from field mapping.

    In other scenarios, CDM automatically maps fields of the source table and the destination table. You need to check whether the mapping and time format are correct. For example, check whether the source field type can be converted into the destination field type.

    Figure 3 Field mapping
    • If the fields from the source and destination do not match, you can drag the fields to make adjustments.
    • On the Map Field tab page, if CDM fails to obtain all columns by obtaining sample values (for example, when data is exported from HBase, CloudTable, or MongoDB, there is a high probability that CDM failed to obtain all columns), you can click and select Add a new field to add new fields to ensure that the data imported to the migration destination is complete.
    • If the data is imported to DWS, you need to select the distribution columns in the destination fields. You are advised to select the distribution columns according to the following principles:
      1. Use the primary key as the distribution column.
      2. If multiple data segments are combined as primary keys, specify all primary keys as the distribution column.
      3. In the scenario where no primary key is available, if no distribution column is selected, DWS uses the first column as the distribution column by default. As a result, data skew risks exist.

  8. CDM supports field conversion. Click and then click Create Converter.

    Figure 4 Creating a converter

    CDM supports the following converters:

    • Anonymization: hides key data in the character string.
      For example, if you want to convert 12345678910 to 123****8910, configure the parameters as follows:
      • Set Reserve Start Length to 3.
      • Set Reserve End Length to 4.
      • Set Replace Character to *.
    • Trim automatically deletes the spaces before and after the character string.
    • Reverse string automatically reverses a character string. For example, reverse ABC into CBA.
    • Replace string replaces the specified character string.
    • Expression conversion uses the JSP expression language (EL) to convert the current field or a row of data. For details, see Field Conversion.
    • Remove line break deletes the newline characters, such as \n, \r, and \r\n from the field.

  9. Click Next, set job parameters, and click Show Advanced Attributes to display and configure optional parameters.

    Figure 5 Task parameters
    Table 3 describes related parameters.
    Table 3 Parameter description

    Parameter

    Description

    Example Value

    Retry upon Failure

    You can select Retry 3 times or Never.

    You are advised to configure automatic retry for only file migration jobs or database migration jobs with Import to Staging Table enabled to avoid data inconsistency caused by repeated data writes.

    Never

    Job

    Select a group where the job resides. The default group is DEFAULT. On the Job Management page, jobs can be displayed, started, or exported by group.

    DEFAULT

    Schedule Execution

    If you select Yes, you can set the start time, cycle, and validity period of a job. For details, see Scheduling Job Execution.

    No

    Concurrent Extractors

    Number of extractors to be concurrently executed. Generally, retain the default value.

    1

    Concurrent Loaders

    Number of Loaders to be concurrently executed

    This parameter is displayed only when HBase or Hive serves as the destination data source.

    3

    Write Dirty Data

    Whether to record dirty data. By default, this parameter is set to No.

    Yes

    Write Dirty Data Link

    This parameter is displayed only when Write Dirty Data is set to Yes.

    Only links to OBS support dirty data writes.

    obs_link

    OBS Bucket

    This parameter is displayed only when Write Dirty Data Link is a link to OBS.

    Name of the OBS bucket to which the dirty data will be written.

    dirtydata

    Dirty Data Directory

    This parameter is displayed only when Write Dirty Data is set to Yes.

    Dirty data is stored in the directory for storing dirty data on OBS. Dirty data is saved only when this parameter is configured.

    You can go to this directory to query data that fails to be processed or is filtered out during job execution, and check the source data that does not meet conversion or cleaning rules.

    /user/dirtydir

    Max. Error Records in a Single Shard

    This parameter is displayed only when Write Dirty Data is set to Yes.

    When the number of error records of a single map exceeds the upper limit, the job will automatically terminate and the imported data cannot be rolled back. You are advised to use a temporary table as the destination table. After the data is imported, rename the table or combine it into the final data table.

    0

    Delete Job After Completion

    After a job is executed, you have three choices:
    • Do not delete: The job is not deleted after it is executed.
    • Delete after success: The job is deleted only when the job is successfully executed. It is used for massive one-time jobs.
    • Delete: The job is deleted regardless of whether it is successfully executed or fails to be executed.

    Do not delete

  10. Click Save or Save and Run. On the page displayed, you can view the job status.

    The job status can be New, Pending, Booting, Running, Failed, or Succeeded.

    Pending indicates that the job is waiting to be scheduled by the system, and Booting indicates that the data to be migrated is being analyzed.