Updated on 2026-05-20 GMT+08:00

Apache HDFS

DataArts Migration can efficiently migrate data from and to Apache HDFS.

Preparation and Constraints

  • Network requirements

    The Apache HDFS data source can communicate with CDM. This ensures smooth data transmission. For details, see Enabling Network Connectivity.

  • Enabling access ports: The default values are slightly different from those of Hadoop. You can enable ports based on the changed values.
    Table 1 Service ports

    Service

    Port Type

    Port Number

    Usage

    HDFS

    TCP

    8020

    HDFS 2.x NameNode service port

    9820

    HDFS 3.x NameNode service port

    9866

    HDFS DataNode service port

Supported Migration Scenarios

DataArts Migration supports the following modes for synchronizing on-premises data:

  • Single table synchronization

    DataArts Migration supports table/file synchronization in data ingestion into a data lake or data migration to the cloud.

  • Database and table shard synchronization

    DataArts Migration supports synchronization of data from multiple databases and tables in data ingestion into a data lake or data migration to the cloud.

  • Entire DB migration

    DataArts Migrations supports synchronization of data from an on-premises database in data ingestion into a data lake or data migration to the cloud.

Database and table shard synchronization and entire DB migration are not supported in all regions. The following table lists the supported Apache HDFS migration scenarios.

Supported Migration Scenario

Single Table Read

Single Table Write

Database/Table Shard Read

Database/Table Shard Write

Entire DB Read

Entire DB Write

Supported

x

x

x

Core Capabilities

  • Connection configuration

    Configuration Item

    Supported

    Description

    Authentication Mode

    SIMPLE, KERBEROS

    Apache HDFS clusters can be accessed through SIMPLE or KERBEROS authentication.

  • Read capabilities

    Configuration Item

    Supported

    Description

    Incremental read

    You can configure the variable path and scheduling to trigger incremental synchronization based on time or file changes.

    Supported file formats

    Binary

    CSV

    PARQUET

    Raw binary files can be read. This is applicable to migration between file systems.

    The standard CSV format is supported. Delimiters and encoding modes can be identified.

    The columnar storage format Parquet is supported, and native Parquet files can be read.

    Shard concurrency

    Multiple threads can run concurrently to read data from files, significantly improving the throughput.

    Dirty data processing

    Abnormal data can be written to the dirty data bucket to prevent job failures caused by a small amount of abnormal data.

    Custom fields

    You can add computed columns, constant columns, or masking functions for tasks to meet personalized service requirements.

  • Write capabilities

    Configuration Item

    Supported

    Description

    Supported file formats

    Binary

    CSV

    Raw binary files can be written. This is applicable to migration between file systems.

    The standard CSV format is supported. Delimiters and encoding modes can be identified.

    Concurrent write

    Concurrent write improves efficiency.

    Dirty data processing

    x

    Abnormal data cannot be written to the dirty data bucket to prevent job failures caused by a small amount of abnormal data.

Creating a Data Source

Create a data source in Management Center. For details, see Configuring Data Connection Parameters.

Creating an Offline Data Migration Job

Create an Apache HDFS migration job in DataArts Factory. For details, see Creating an Offline Processing Migration Job.