Updated on 2024-09-10 GMT+08:00

Hudi table initialization

  1. Usually, a Spark job is used to initialize and import inventory data. The initialization data volume is large. Therefore, you are advised to use APIs to provide sufficient resources for the initialization.
  2. In the scenario where Flink or Spark stream jobs need to be written in real time after batch initialization, you are advised to filter messages on the and consume the messages from a specified time range to control the repeated data access volume. (For example, after Spark initialization is complete, Flink filters out data generated two hours ago when consuming Kafka.). If Kafka messages cannot be filtered, you can access the data in real time to generate offsets, truncate tables, import historical data, and enable real-time data.
  1. If the table already contains data and no truncate table exists before batch initialization, the batch data will be written into a large log file, which poses great pressure on subsequent compaction and requires more resources.
  2. Hudi tables are stored in Hive metadata. There should be one internal table (manually created) and two external tables (automatically created after data is written).
  3. Two external tables named _ro (Users read only the merged parquet file, that is, read the optimized view chart.), _rt (Read the latest version of data written in real time, i.e. real-time view chart).