Apache HDFS

DataArts Migration can efficiently migrate data from and to Apache HDFS.

Preparation and Constraints

Network requirements
The Apache HDFS data source can communicate with CDM. This ensures smooth data transmission. For details, see Enabling Network Connectivity.
Enabling access ports: The default values are slightly different from those of Hadoop. You can enable ports based on the changed values.

Table 1 Service ports

Service

Port Type

Port Number

Usage

HDFS

TCP

8020

HDFS 2.x NameNode service port

9820

HDFS 3.x NameNode service port

9866

HDFS DataNode service port

**Table 1** Service ports
Service	Port Type	Port Number	Usage
HDFS	TCP	8020	HDFS 2.x NameNode service port
9820	HDFS 3.x NameNode service port
9866	HDFS DataNode service port

Supported Migration Scenarios

DataArts Migration supports the following modes for synchronizing on-premises data:

Single table synchronization
DataArts Migration supports table/file synchronization in data ingestion into a data lake or data migration to the cloud.
Database and table shard synchronization
DataArts Migration supports synchronization of data from multiple databases and tables in data ingestion into a data lake or data migration to the cloud.
Entire DB migration
DataArts Migrations supports synchronization of data from an on-premises database in data ingestion into a data lake or data migration to the cloud.

Database and table shard synchronization and entire DB migration are not supported in all regions. The following table lists the supported Apache HDFS migration scenarios.

Supported Migration Scenario	Single Table Read	Single Table Write	Database/Table Shard Read	Database/Table Shard Write	Entire DB Read	Entire DB Write
Supported	√	√	x	√	x	x

Core Capabilities

Connection configuration

Configuration Item

Supported

Description

Authentication Mode

SIMPLE, KERBEROS

Apache HDFS clusters can be accessed through SIMPLE or KERBEROS authentication.

Configuration Item	Supported	Description
Authentication Mode	SIMPLE, KERBEROS	Apache HDFS clusters can be accessed through SIMPLE or KERBEROS authentication.

Read capabilities

Configuration Item	Supported	Description
Incremental read	√	You can configure the variable path and scheduling to trigger incremental synchronization based on time or file changes.
Supported file formats	Binary CSV PARQUET	Raw binary files can be read. This is applicable to migration between file systems. The standard CSV format is supported. Delimiters and encoding modes can be identified. The columnar storage format Parquet is supported, and native Parquet files can be read.
Shard concurrency	√	Multiple threads can run concurrently to read data from files, significantly improving the throughput.
Dirty data processing	√	Abnormal data can be written to the dirty data bucket to prevent job failures caused by a small amount of abnormal data.
Custom fields	√	You can add computed columns, constant columns, or masking functions for tasks to meet personalized service requirements.

Write capabilities

Configuration Item	Supported	Description
Supported file formats	Binary CSV	Raw binary files can be written. This is applicable to migration between file systems. The standard CSV format is supported. Delimiters and encoding modes can be identified.
Concurrent write	√	Concurrent write improves efficiency.
Dirty data processing	x	Abnormal data cannot be written to the dirty data bucket to prevent job failures caused by a small amount of abnormal data.

Configuration Item

Supported

Description

Supported file formats

Binary

CSV

Raw binary files can be written. This is applicable to migration between file systems.

The standard CSV format is supported. Delimiters and encoding modes can be identified.

Concurrent write

√

Concurrent write improves efficiency.

Dirty data processing

Abnormal data cannot be written to the dirty data bucket to prevent job failures caused by a small amount of abnormal data.

Creating a Data Source

Create a data source in Management Center. For details, see Configuring Data Connection Parameters.

Creating an Offline Data Migration Job

Create an Apache HDFS migration job in DataArts Factory. For details, see Creating an Offline Processing Migration Job.

Parent topic: Supported Data Sources

Previous topic: GBase

Next topic: Apache Hive

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot