Updated on 2026-05-20 GMT+08:00

Apache Hive

DataArts Migration supports the main versions of open-source Apache-Hive, meeting your data synchronization requirements in various deployment environments.

Preparation and Constraints

  • Network requirements

    The Apache Hive data source can communicate with CDM. This ensures smooth data transmission. For details, see Enabling Network Connectivity.

  • Enabling access ports: When configuring the Apache Hive data source, ensure that the following ports have been enabled in the security group or network so that DataArts Migration can access MRS.
    Table 1 Service ports

    Service

    Port Type

    Port Number

    Usage

    Hive

    TCP

    10000

    JDBC/ODBC interface of HiveServer, which is used by DataArts Migration to submit SQL statements and obtain results

    Hive

    TCP

    9083

    Hive Metastore interface, which is used by HiveServer to obtain metadata such as databases, tables, and partitions

    HDFS

    TCP

    8020

    Remote Procedure Call (RPC) port for NameNode, which is used by the client to establish context of file systems and obtain block locations

    HDFS

    TCP

    9866

    Streaming port for DataNode to read and write table files (ORC/Parquet)

    Zookeeper

    TCP

    2181

    ZK quorum on which HA HiveServer/Metastore and HDFS NameNode HA depend

Supported Data Types

The following table lists supported Hive data types.

Category

Hive Data Type

Read

String

CHAR

VARCHAR

STRING

Integer

TINYINT

SMALLINT

INT

INTEGER

BIGINT

Floating point

FLOAT

DOUBLE

DECIMAL

Date/Time

TIMESTAMP

DATE

Boolean

BOOLEAN

Binary

BINARY

Complex type

ARRAY

MAP

STRUCT

x

UNIONTYPE

x

Supported Migration Scenarios

DataArts Migration supports the following modes for synchronizing on-premises data:

  • Single table synchronization

    DataArts Migration supports table/file synchronization in data ingestion into a data lake or data migration to the cloud.

  • Database and table shard synchronization

    DataArts Migration supports synchronization of data from multiple databases and tables in data ingestion into a data lake or data migration to the cloud.

  • Entire DB migration

    DataArts Migrations supports synchronization of data from an on-premises database in data ingestion into a data lake or data migration to the cloud.

Database and table shard synchronization and entire DB migration are not supported in all regions. The following table lists the supported Apache Hive migration scenarios.

Supported Migration Scenario

Single Table Read

Single Table Write

Database/Table Shard Read

Database/Table Shard Write

Entire DB Read

Entire DB Write

Supported

x

x

√ (supported in some regions)

Core Capabilities

  • Connection configuration

    Configuration Item

    Supported

    Description

    Kerberos authentication

    Kerberos authentication is used to access MRS clusters.

    Storage-compute decoupling

    The storage-compute decoupling architecture is supported, and data can be read from different Hive storage file systems, such as OBS and HDFS.

  • Read capabilities

    Configuration Item

    Supported

    Description

    Read mode

    JDBC/HDFS

    HDFS files can be read through JDBC or directly. JDBC is suitable for interactive query and can flexibly read data using SQL syntax. When there is a large amount of data, directly reading the data and skipping SQL parsing is more efficient.

    Shard concurrency

    Horizontal sharding and multi-thread concurrent extraction significantly improve the throughput and efficiency. Currently, files can be concurrently read only from the HDFS.

    Custom fields

    x

    You can add computed columns, constant columns, or masking functions for tasks to meet personalized service requirements. Currently, this function is not supported.

    Dirty data processing

    Abnormal data can be written to the dirty data bucket to prevent job failures caused by a small amount of abnormal data.

    Incremental read

    Incremental read can be read through partition filtering or SQL statements.

  • Write capabilities

    Configuration Item

    Supported

    Description

    Write mode

    Insert into/Insert overwrite

    Two write modes are supported: INSERT INTO and INSERT OVERWRITE. Insert into appends data to the target table, which is applicable to incremental data writing. Insert overwrite overwrites data in the target table or partition, which is applicable to full data update.

    Pre- and post-import processing

    Partitions can be cleared in truncate mode.

    Dirty data processing

    x

    Abnormal data cannot be written to the dirty data bucket to prevent job failures caused by a small amount of abnormal data.

    Concurrent write

    Concurrent write can fully utilize cluster resources to improve the data write speed.

Creating a Data Source

Create a data source in Management Center. For details, see Configuring Data Connection Parameters.

Creating an Offline Data Migration Job

Create an Apache Hive migration job in DataArts Factory. For details, see Creating an Offline Processing Migration Job.