Configuring Hive to Read a Hudi Table
Hive External Tables Corresponding to Hudi Tables
A Hudi source table corresponds to a copy of HDFS data. The Hudi table data can be mapped to a Hive external table through the Spark component, Flink component, or Hudi client. Based on the external table, Hive can perform real-time view query, read-optimized view query, and incremental view query.
- When the Hudi source table is a Copy-On-Write (COW) table, it can be mapped to a Hive external table. The table supports real-time query and incremental view query.
- When the Hudi source table is a Merge-On-Read (MOR) table, it can be mapped to two Hive external tables (RO table and RT table). The RO table supports read-optimized view query, and the RT table supports real-time view query and incremental view query.
Notes and Constraints
- You cannot perform insert, update, delete, load, merge, alter, or msck operations on the Hive external table. Only the select operation is supported.
- Granting Table Permissions
- Backup and Restoration
You only need to back up one of the RT and RO tables if necessary. This is because the RO and RT tables are mapped by the same Hudi source table. If one table is backed up, the other table is also backed up.
- Component Version
- Hive: FusionInsight_HD_xxx; Hive kernel version 3.1.0.
- Spark2x: FusionInsight_Spark2x_xxx; Hudi kernel version 0.11.0.
Creating Hive External Tables Corresponding to Hudi Tables
You can create a Hudi table in spark-shell or spark-sql and insert data into the table. For details, see Creating a Hudi Table Using Spark Shell or Performing Operations on a Hudi Table Using Spark SQL.
Generally, Hudi table data is synchronized to Hive external tables when the data is imported to the lake. In this case, you can directly query the corresponding Hive external tables in Beeline. If the data is not synchronized to the Hive external tables, you can manually synchronize the data using the Hudi client tool. For details, see Synchronizing Hudi Table Data to Hive.
Querying Hive External Tables Corresponding to Hudi Tables
Before performing incremental query on Hudi tables using Hive, you need to set three table-level parameters in Table 1. Each Hudi source table corresponds to three parameters. hudisourcetablename indicates the name of the Hudi source table (not the name of the Hive external table). You can change the name as required.
Parameter |
Description |
Default Value |
---|---|---|
hoodie.hudisourcetablename.consume.mode |
Query mode of the Hudi table.
|
None |
hoodie.hudisourcetablename.consume.start.timestamp |
Start time of the incremental query on the Hudi table.
|
None |
hoodie.hudisourcetablename.consume.max.commits |
The incremental query on the Hudi table is based on the number of commits after hoodie.hudisourcetablename.consume.start.timestamp.
|
None |
To query a Hudi table, perform the following steps:
- Log in to the node where the client is installed as the Hive client installation user.
- Go to the client installation directory, configure environment variables, and authenticate the user.
Go to the client installation directory.
cd Client installation directory
Load the environment variables.
source bigdata_env
Authenticate the user. If Kerberos authentication is disabled for the cluster (in normal mode), skip this step.
kinit Hive service user
- Log in to the Hive client.
beeline
- Query the Hudi table.
- Querying a Hudi COW Table
For example, if the name of a Hudi COW source table is hudicow, the name of the mapped Hive external table is hudicow.
- Real-time view query on the COW table:
select * from hudicow;
- Incremental query on the COW table: Set three incremental query parameters based on the name of the Hudi source table. The where clause of the incremental query statements must contain `_hoodie_commit_time`>'xxx', where xxx indicates the value of hoodie.hudisourcetablename.consume.start.timestamp.
Set the query mode of the Hudi table to INCREMENTAL.
set hoodie.hudicow.consume.mode= INCREMENTAL;
Set the number of commits for incremental queries on the Hudi table.
set hoodie.hudicow.consume.max.commits=3;
Set the start time of incremental query on the Hudi table.
set hoodie.hudicow.consume.start.timestamp= 20200427114546;
Perform an incremental query on the COW table.
select count(*) from hudicow where `_hoodie_commit_time`>'20200427114546';
- Real-time view query on the COW table:
- Querying a Hudi MOR Table
For example, if the name of the Hudi MOR source table is hudimor, the table is mapped to two Hive external tables hudimor_ro (RO table) and hudimor_rt (RT table).
- Read-optimized view query on the RO table:
select * from hudicow_ro;
- Real-time view query on the RT table:
select * from hudicow_rt;
- Incremental query on the RT table: Set three incremental query parameters based on the name of the Hudi source table. The where clause of the incremental query statements must contain `_hoodie_commit_time`>'xxx', where xxx indicates the value of hoodie.hudisourcetablename.consume.start.timestamp.
Set the query mode of the Hudi table.
- Set incremental query for the Hudi table.
set hoodie.hudimor.consume.mode=INCREMENTAL;
- Setting other query modes for the Hudi table.
set hoodie.hudisourcetablename.consume.mode=SNAPSHOT;
Set the number of commits for incremental queries on the Hudi table.
set hoodie.hudimor.consume.max.commits=-1;
Set the start time of incremental query on the Hudi table.
set hoodie.hudimor.consume.start.timestamp=20210207144611;
Query the RT table.
select * from hudimor_rt where `_hoodie_commit_time`>'20210207144611';
- Set incremental query for the Hudi table.
- Read-optimized view query on the RO table:
- Querying a Hudi COW Table
Querying Hive External Tables Corresponding to Hudi Schema Evolution Tables
If the Hudi table is a schema evolution table (the table fields have been modified), you need to set the following parameter when querying the Hudi table on the Hive client.
set hive.exec.schema.evolution=true;
If this parameter is set to true, Hive automatically adapts to schema changes of Hudi tables. This ensures to provide accurate and consistent query results based on the latest table structure.
The following illustrates the parameters for querying views. Take the real-time view of the COW table as an example.
Set hive.exec.schema.evolution to true.
set hive.exec.schema.evolution=true;
Query the Hive external table.
select * from hudicow;
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot