Updated on 2024-11-29 GMT+08:00

Reading MOR Table Views

After the MOR table is synchronized to Hive, the following two tables are synchronized to Hive: Table name_rt and Table name_ro. The table suffixed with rt indicates the real-time view, and the table suffixed with ro indicates the read-optimized view. For example, the name of the Hudi table to be synchronized to Hive is test. After the table is synchronized to Hive, two more tables test_rt and test_ro are generated in the Hive table.

  • Reading the real-time view (using Hive and SparkSQL as an example): Directly read the Hudi table with suffix _rt stored in Hive.
    select count(*) from test_rt;
  • Reading the real-time view (using the Spark DataSource API as an example): The operations are the same as those for the COW table. For details, see the operations for the COW table.
  • Reading the incremental view (using Hive as an example):
    set hive.input.format=org.apache.hudi.hadoop.hive.HoodieCombineHiveInputFormat; // This parameter does not need to be specified for SparkSQL.
    set hoodie.test.consume.mode=INCREMENTAL;
    set hoodie.test.consume.max.commits=3;
    set hoodie.test.consume.start.timestamp=20201227153030;
    select count(*) from default.test_rt where `_hoodie_commit_time`>'20201227153030'; // Results must be filtered by start.timestamp and end.timestamp. If end.timestamp is not specified, only start.timestamp is required for filtering.
  • Reading the incremental view (using Spark SQL as an example):
    set hoodie.test.consume.mode=INCREMENTAL;
    set hoodie.test.consume.start.timestamp=20201227153030;  // Specify the initial incremental pull commit.
    set hoodie.test.consume.end.timestamp=20210308212318;  // Specify the end commit of the incremental pull. If this parameter is not specified, the latest commit is used.
    select count(*) from test_rt where `_hoodie_commit_time`>'20201227153030' and `_hoodie_commit_time`<='20210308212318'; // Results must be filtered by start.timestamp and end.timestamp. If end.timestamp is not specified, only start.timestamp is required for filtering.
  • Incremental view (using the Spark DataSource API as an example): The operations are the same as those for the COW table. For details, see the operations for the COW table.
  • Reading the read-optimized view (using Hive and SparkSQL as an example): Directly read the Hudi table with suffix _ro stored in Hive.
    select count(*) from test_ro;
  • Reading the read-optimized view (using the Spark DataSource API as an example): This is similar to reading a common DataSource table.

    QUERY_TYPE_OPT_KEY must be set to QUERY_TYPE_READ_OPTIMIZED_OPT_VAL.

    spark.read.format("hudi")
    .option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_READ_OPTIMIZED_OPT_VAL) // Set the query type to the read-optimized view.
    .load("/tmp/default/mor_bugx/") // Specify the path of the Hudi table to read.
    .createTempView("mycall")
    spark.sql("select * from mycall").show(100)