Hudi Table Initialization

The initial import of existing data is usually done by Spark jobs. Since the initial data volume is typically large, you are advised to use APIs to allocate sufficient resources.
In scenarios where real-time writing by Flink or Spark streaming jobs is required after batch initialization, you are advised to control the amount of duplicated data access by filtering messages from a specified time range (for example, after Spark initialization is completed, Flink filters out data from 2 hours ago when consuming from Kafka). If filtering Kafka messages is not possible, you can consider ingesting data in real-time to generate an offset, truncating the table, then performing a historical import, and finally starting real-time ingestion.

If data already exists in the table before batch initialization without truncating the table, it will result in very large log files, putting significant pressure on subsequent compaction and requiring more resources to complete.
In the Hive metadata, a Hudi table should have one internal table (manually created) and two external tables (automatically created after writing data).
The two external tables, _ro (user-read-only merged Parquet file, i.e., read-optimized view table) and _rt (read latest version of real-time written data, i.e., real-time view table).

Parent topic: Bucket Tuning