Updated on 2024-09-10 GMT+08:00

Hudi Table Model Design Specifications

rules

  • A proper primary key must be set for the Hudi table.

    Hudi tables provide the data update and idempotent write capabilities. This capability requires that primary keys must be set for Hudi tables. Improper primary keys will cause duplicate data. The primary key can be a single primary key or a composite primary key. The primary key cannot have null or null values. To set the primary key, see the following example:

    SparkSQL:

    //Use primaryKey to specify the primary key. If the primary key is a composite primary key, separate it with commas (,).
    create table hudi_table (
    id1 int,
    id2 int,
    name string,
    price double
    ) using hudi
    options (
    primaryKey = 'id1,id2',
    preCombineField = 'price'
    );

    SparkDatasource:

    //Use hoodie.datasource.write.recordkey.field to specify the primary key.
    df.write.format("hudi").
    option("hoodie.datasource.write.table.type", COPY_ON_WRITE).
    option("hoodie.datasource.write.precombine.field", "price").
    option("hoodie.datasource.write.recordkey.field", "id1, id2").

    FlinkSQL:

    //Use hoodie.datasource.write.recordkey.field to specify the primary key.
    create table hudi_table(
    id1 int,
    id2 int,
    name string,
    price double
    ) partitioned by (name) with (
    'connector'='hudi',
    'hoodie.datasource.write.recordkey.field' = 'id1,id2',
    'write.precombine.field'='price')
  • The precombine field must be set in the Hudi table.

    During data synchronization, data may be repeatedly written and disordered, for example, abnormal data recovery and abnormal restart of the writer program. You can set the precombine field to a proper value to ensure data accuracy. Old data will not overwrite new data, that is, the idempotent write capability. This field can be of the following types: service table update timestamp, database submission timestamp, and so on. The value of the precombine field cannot be null or null. You can set the precombine field by referring to the following example:

    SparkSQL:

    //Specify the precombine field by using the preCombineField field.

    create table hudi_table (
    id1 int,
    id2 int,
    name string,
    price double
    ) using hudi
    options (
    primaryKey = 'id1,id2',
    preCombineField = 'price'
    );

    SparkDatasource:

    //Specify the precombine field by using hoodie.datasource.write.precombine.field.
    df.write.format("hudi").
    option("hoodie.datasource.write.table.type", COPY_ON_WRITE).
    option("hoodie.datasource.write.precombine.field", "price").
    option("hoodie.datasource.write.recordkey.field", "id1, id2").

    Flink:

    //Specify the precombine field by using write.precombine.field.
    create table hudi_table(
    id1 int,
    id2 int,
    name string,
    price double
    ) partitioned by (name) with (
    'connector'='hudi',
    'hoodie.datasource.write.recordkey.field' = 'id1,id2',
    'write.precombine.field'='price')
  • Flow calculation uses the MOR table.

    Streaming computing is real-time computing with low latency and requires high-performance streaming read and write capabilities. Among the MOR and COW models in Hudi tables, the MOR table has better streaming read and write performance. Therefore, the MOR table model is used in streaming computing scenarios. The following table lists the comparison between the read and write performance of MOR tables.

    Comparison Dimension

    MOR table

    COW Table

    Stream write

    High

    Low

    Streaming Read

    High

    Low

    Batch write

    High

    Low

    Batch Read

    Low

    High

  • Real-time into the lake, the table model adopts MOR table.

    Generally, the performance requirements for real-time lake entry are within minutes or at the minute level. Based on the comparison between Hudi and Hudi table models, the MOR table model needs to be selected in the real-time lake entry scenario.

  • Use lowercase letters for Hudi table names and column names.

    When multiple engines read and write the same Hudi table, lowercase letters are used to avoid case difference between engines.

The suggestion

  • In the Spark batch processing scenario, the COW table is used in the scenario where the write latency is not high.

    In the COW table model, the write speed is low because of write amplification. But COW has very good readability. In addition, batch computing is not sensitive to write latency, so COW tables can be used.

  • The Hive metadata synchronization function must be enabled for the Hudi table write task.

    SparkSQL is naturally integrated with Hive, and metadata is not required. This suggestion is applicable to the scenario where the Hudi table is written through the Spark Datasource API or Flin. When the Hudi table is written through the Spark Datasource API or Flin, the configuration item for synchronizing metadata to the Hive needs to be added. This configuration is used to uniformly host the metadata of Hudi tables in the Hive metadata service, facilitating cross-engine data operations and data management.