Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ DataArts Studio/ User Guide/ DataArts Migration (Real-Time Jobs)/ Job Performance Optimization/ Optimizing the Parameters of a Job for Migrating Data from MySQL to MRS Hudi

Optimizing the Parameters of a Job for Migrating Data from MySQL to MRS Hudi

Updated on 2025-02-18 GMT+08:00

Optimizing Source Parameters

Optimization of data extraction from MySQL

You can click Add Custom Attribute in the Configure Task area and add MySQL synchronization parameters.

Figure 1 Adding custom attributes

The following tuning parameters are available.

Table 1 Tuning parameters for full data synchronization

Parameter

Type

Default Value

Description

scan.incremental.snapshot.backfill.skip

boolean

true

Whether to skip reading binlogs. The default value is true. Skipping reading binlogs can effectively reduce memory usage. Note that skipping reading binlogs provides only at-least-once guarantee.

scan.incremental.snapshot.chunk.size

int

50000

Shard size, which determines the maximum number of data records in a single shard and the number of shards in the full migration phase. The larger the shard size, the more data records in a single shard, and the smaller the number of shards.

If a table has a large number of records, the job will be divided into multiple shards, occupying too much memory. To avoid this issue, reduce the number of records in the table.

If scan.incremental.snapshot.backfill.skip is false, the real-time processing migration job caches data of a single shard. In this case, a larger shard occupies more memory, causing memory overflow. To avoid this issue, reduce the shard size.

scan.snapshot.fetch.size

int

1024

Maximum number of data records that can be extracted from the MySQL database in a single request during full data extraction. Increasing the number of requests can reduce the number of requests to the MySQL database and improve performance.

debezium.max.queue.size

int

8192

Number of data cache queues. The default value is 8192. If the size of a single data record in the source table is too large (for example, 1 MB), memory overflow occurs when too much data is cached. You can reduce the value.

debezium.max.queue.size.in.bytes

int

0

Size of the data cache queue. The default value is 0, indicating that the cache queue is calculated based on the number of data records instead of the data size. If debezium.max.queue.size cannot effectively limit memory usage, you can explicitly set this parameter to limit the size of cached data.

jdbc.properties.socketTimeout

int

300000

Timeout interval of the socket for connecting to the MySQL database in the full migration phase. The default value is 5 minutes. If the MySQL database is overloaded, and the SocketTimeout exception occurs for a job, you can increase the value of this parameter.

jdbc.properties.connectTimeout

int

60000

Timeout interval of the connection to the MySQL database in the full migration phase. The default value is 1 minute. If the MySQL database is overloaded, and the ConnectTimeout exception occurs for a job, you can increase the value of this parameter.

Table 2 Tuning parameters for incremental data synchronization

Parameter

Type

Default Value

Description

debezium.max.queue.size

int

8192

Number of data cache queues. The default value is 8192. If the size of a single data record in the source table is too large (for example, 1 MB), memory overflow occurs when too much data is cached. You can reduce the value.

debezium.max.queue.size.in.bytes

int

0

Size of the data cache queue. The default value is 0, indicating that the cache queue is calculated based on the number of data records instead of the data size. If debezium.max.queue.size cannot effectively limit memory usage, you can explicitly set this parameter to limit the size of cached data.

Optimizing Destination Parameters

Optimization of data writing to Hudi

If data is written to the Hudi table slowly, check whether the table is properly designed. You are advised to use an MOR table that uses Hudi bucket indexes and configure the number of buckets to achieve an optimal migration performance.

NOTE:
  • Using bucket indexes: You can configure the index.type and hoodie.bucket.index.num.buckets attributes in Global Configuration of Hudi Table Attributes or Edit Table Attribute of the mapped table.
  • Determine whether to use partitioned or non-partitioned tables.

    There are two types of tables, fact tables and dimension tables.

    • Fact tables generally have a large amount of data, most of which is new data and a small proportion of which is the data updated in a recent period (years, months, or days). A downstream system that reads a fact table for ETL calculation splits the table based on the data creation time (for example, last day, month, or year) into partitioned tables, ensuring optimal read and write performance.
    • Dimension tables generally contain a small amount of data, most of which is updated data and a small proportion of which is new data. The data volume of a dimension table is stable, and all data is read for ETL calculation such as join. Therefore, non-partitioned tables are more suitable as they provide better performance.
  • Determine the number of buckets in a table.

    If you use a Hudi bucket table, you need to set the number of buckets, which affects the table performance.

    • Number of buckets for a non-partitioned table = MAX(Data volume of the table (GB)/2 GB x 2, rounded up, 4)
    • Number of buckets for a partitioned table = MAX(Data volume of a partition (GB)/2 GB x 2, rounded up, 1)

      Notes:

    • The total data volume of a table, rather than the size of a compressed file, is used.
    • An even number is preferred for the number of buckets. Set the minimum number of buckets for a non-partitioned table to 4 and that for a partitioned table to 1.

In addition, you can click Global Configuration of Hudi Table Attributes in the Hudi destination configuration or click Edit Table Attribute in the mapped table to add optimization parameters.

Figure 2 Adding custom attributes
Table 3 Parameters for optimizing Hudi writing

Parameter

Type

Default Value

Description

hoodie.sink.flush.tasks

int

1

Number of concurrent Hudi flush tasks. The default value is 1, indicating sequential writing. If Hudi commits a large number of FleGroups (for example, a large amount of historical data of the source table is updated), you can increase the value of this parameter.

FileGroup data flushed by a single thread = Number of FileGroups committed at a time/Number of concurrent jobs

If the number of FileGroups flushed by a single thread is less than or equal to 5, the recommended value for this parameter is 2.

If the number of FileGroups flushed by a single thread is less than or equal to 10, the recommended value for this parameter is 5.

If the number of FileGroups flushed by a single thread is less than or equal to 25, the recommended value for this parameter is 10.

If the number of FileGroups flushed by a single thread is less than or equal to 50, the recommended value for this parameter is 20.

If the number of FileGroups flushed by a single thread is greater than 50, the recommended value for this parameter is 30.

The larger the number of concurrent flush tasks, the higher the memory during flushing. Adjust the value based on the memory monitoring of the real-time processing migration job.

hoodie.context.flatmap.parallelism

int

1

When Hudi performs commit operations, it scans partitions. By default, one scan operation is performed at a time. If a large number of partitions are involved in a commit operation, you can increase the value of this parameter to accelerate the commit operation.

If the number of partitions committed at a time is less than or equal to 10, the recommended value for this parameter is 5.

If the number of partitions committed at a time is less than or equal to 25, the recommended value for this parameter is 10.

If the number of partitions committed at a time is less than or equal to 50, the recommended value for this parameter is 20.

If the number of partitions committed at a time is greater than 50, the recommended value for this parameter is 30.

compaction.async.enabled

boolean

true

Whether to enable compaction. The default value is true, indicating that compaction is enabled for Hudi. The compaction operation affects the write performance of a real-time migration job. To ensure the stability of the migration job, you can set this parameter to false and split Hudi Compaction into Spark jobs for MRS to execute. For details, see How Do I Configure a Periodic Spark Task for Hudi Compaction?

compaction.delta_commits

int

5

Frequency at which compaction requests are generated for real-time processing migration jobs. The default value is 5, indicating that a compaction request is generated every five commits. Lowering the compaction request generation frequency reduces the compaction frequency and improves job performance. If the incremental Hudi data is small, you can increase the value of this parameter.

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback