Help Center/
MapReduce Service/
Component Development Specifications/
Hudi/
Bucket Tuning/
Hudi Table Initialization
Updated on 2025-04-15 GMT+08:00
Hudi Table Initialization
- The initial import of existing data is usually done by Spark jobs. Since the initial data volume is typically large, you are advised to use APIs to allocate sufficient resources.
- In scenarios where real-time writing by Flink or Spark streaming jobs is required after batch initialization, you are advised to control the amount of duplicated data access by filtering messages from a specified time range (for example, after Spark initialization is completed, Flink filters out data from 2 hours ago when consuming from Kafka). If filtering Kafka messages is not possible, you can consider ingesting data in real-time to generate an offset, truncating the table, then performing a historical import, and finally starting real-time ingestion.

- If data already exists in the table before batch initialization without truncating the table, it will result in very large log files, putting significant pressure on subsequent compaction and requiring more resources to complete.
- In the Hive metadata, a Hudi table should have one internal table (manually created) and two external tables (automatically created after writing data).
- The two external tables, _ro (user-read-only merged Parquet file, i.e., read-optimized view table) and _rt (read latest version of real-time written data, i.e., real-time view table).
Parent topic: Bucket Tuning
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot