Updated on 2023-11-03 GMT+08:00

Hive Supporting Reading Hudi Tables

Hive External Tables Corresponding to Hudi Tables

A Hudi source table corresponds to a copy of HDFS data. The Hudi table data can be mapped to a Hive external table through the Spark component, Flink component, or Hudi client. Based on the external table, Hive can easily perform real-time query, read-optimized view query, and incremental view query.

  • Different view queries are provided for different types of Hudi source tables:
    • When the Hudi source table is a Copy-On-Write (COW) table, it can be mapped to a Hive external table. The table supports real-time query and incremental view query.
    • When the Hudi source table is a Merge-On-Read (MOR) table, it can be mapped to two Hive external tables (RO table and RT table). The RO table supports read-optimized view query, and the RT table supports real-time view query and incremental view query.
  • Hive external tables cannot be added, deleted, or modified (including insert, update, delete, load, merge, alter and msck). Only the query operation (select) is supported.
  • Granting table permissions: The update, alter, write, and all permissions cannot be modified.
  • Backup and restoration: The RO and RT tables are mapped from the same Hudi source table. When one table is backed up, the other table is also backed up. The same applies to restoration. Therefore, only one table needs to be backed up.
  • Component versions:
    • Hive: FusionInsight_HD_xxx; Hive kernel version 3.1.0.
    • Spark2x: FusionInsight_Spark2x_xxx; Hudi kernel version 0.11.0

Creating Hive External Tables Corresponding to Hudi Tables

Generally, Hudi table data is synchronized to Hive external tables when the data is imported to the lake. In this case, you can directly query the corresponding Hive external tables in Beeline. If the data is not synchronized to the Hive external tables, you can use the Hudi client tool run_hive_sync_tool.sh to synchronize data manually.

Querying Hive External Tables Corresponding to Hudi Tables

Prerequisites

Before using Hive to perform incremental query on Hudi tables, you need to set another three parameters in Table 1. The three parameters are table-level parameters. Each Hudi source table corresponds to three parameters, where hudisourcetablename indicates the name of the Hudi source table (not the name of the Hive external table).

Table 1 Parameter description

Parameter

Default Value

Description

hoodie. hudisourcetablename.consume.mode

None

Query mode of the Hudi table.

  • Incremental query: Set it to INCREMENTAL.
  • Non-incremental query: Do not set this parameter or set it to SNAPSHOT.

hoodie. hudisourcetablename.consume.start.timestamp

None

Start time of the incremental query on the Hudi table.

  • Incremental query: start time of the incremental query.
  • Non-incremental query: Do not set this parameter.

hoodie. hudisourcetablename.consume.max.commits

None

The incremental query on the Hudi table is based on the number of commits after hoodie.hudisourcetablename.consume.start.timestamp.

  • Incremental query: number of commits. For example, if this parameter is set to 3, data after three commits from the specified start time is queried. If this parameter is set to -1, all data committed after the specified start time is queried.
  • Non-incremental query: Do not set this parameter.

Querying a Hudi COW Table

For example, the name of a Hudi source table of the COW type is hudicow, and the name of the mapped Hive external table is hudicow.

  • Real-time view query on the COW table:

    Select * from hudicow;

  • Incremental query on the COW table: Set three incremental query parameters based on the name of the Hudi source table. The where clause of the incremental query statements must contain `_hoodie_commit_time`>'xxx', where xxx indicates the value of hoodie.hudisourcetablename.consume.start.timestamp.

    set hoodie.hudicow.consume.mode= INCREMENTAL;

    set hoodie.hudicow.consume.max.commits=3;

    set hoodie.hudicow.consume.start.timestamp= 20200427114546;

    select count(*) from hudicow where `_hoodie_commit_time`>'20200427114546';

Querying a Hudi MOR Table

For example, the name of a Hudi source table of the MOR type is hudimor, and the two mapped Hive external tables are hudimor_ro (RO table) and hudimor_rt (RT table).

  • Read-optimized view query on the RO table:

    Select * from hudicow_ro;

  • Real-time view query on the RT table:

    Select * from hudicow_rt;

  • Incremental query on the RT table: Set three incremental query parameters based on the name of the Hudi source table. The where clause of the incremental query statements must contain `_hoodie_commit_time`>'xxx', where xxx indicates the value of hoodie.hudisourcetablename.consume.start.timestamp.

    set hoodie.hudimor.consume.mode=INCREMENTAL;

    set hoodie.hudimor.consume.max.commits=-1;

    set hoodie.hudimor.consume.start.timestamp=20210207144611;

    select * from hudimor_rt where `_hoodie_commit_time`>'20210207144611';

    set hoodie.hudisourcetablename.consume.mode=INCREMENTAL; is used only for the incremental query on the table. To switch to another query mode, run set hoodie.hudisourcetablename.consume.mode=SNAPSHOT;.

Querying Hive External Tables Corresponding to Hudi Schema Evolution Tables

If the Hudi table is a schema evolution table (some fields in the table have been modified), you need to set set hive.exec.schema.evolution to true when Hive queries the table.

The following uses the query of the real-time view of a COW table as an example. To query other views, you need to add this parameter.

  • Real-time view query on the COW table:

    set hive.exec.schema.evolution=true;

    select * from hudicow;