Updated on 2024-12-11 GMT+08:00

Hudi Table Overview

Table Type

  • Copy On Write

    Copy-on-write (COW) tables store data in Parquet files. Internal update operations need to be performed by rewriting the original Parquet files.

    • Advantage: It is efficient because only one data file in the corresponding partition needs to be read.
    • Disadvantage: During data write, a previous copy needs to be copied and then a new data file is generated based on the previous copy. This process is time-consuming. Therefore, the data read by the read request lags behind.
  • Merge On Read

    MOR tables store data in a hybrid format combining columnar-based Parquet and row-based format Avro. Parquet files are used to store base data, and Avro files (also called log files) are used to store incremental data.

    • Advantage: Data is written to the delta log first, and the delta log size is small. Therefore, the write cost is low.
    • Disadvantage: Files need to be compacted periodically. Otherwise, there are a large number of fragment files. The read performance is poor because delta logs and old data files need to be merged.

Hudi Table Storage

When writing data, Hudi generates a Hudi table based on attributes such as the storage path, table name, and partition structure.

Hudi table data files can be stored in the OS file system or distributed file system such as HDFS. To ensure analysis performance and data reliability, HDFS is generally used for storage. Using HDFS as an example, Hudi table storage files are classified into two types.

  • The .hoodie folder stores the log files related to file merging.

  • The path containing _partition_key stores actual data files and metadata by partition.

    Hudi data files of are stored in Parquet base files and Avro log files.

    To view a Hudi table, log in to FusionInsight Manager and choose Cluster > Services > HDFS. On the Dashboard tab page, click the link next to NameNode WebUI. On the HDFS web UI that is displayed, choose Utilities > Browse the file system.