Updated on 2024-12-11 GMT+08:00

Reading the Hudi COW Table View

  • Reading the real-time view (using Hive and SparkSQL as an example): Directly read the Hudi table stored in Hive and use ${table_name} to specify the table name.
    select count(*) from ${table_name};
  • Reading the real-time view (using the Spark DataSource API as an example): This is similar to reading a common DataSource table.

    The query type QUERY_TYPE_OPT_KEY must be set to QUERY_TYPE_SNAPSHOT_OPT_VAL. Use ${table_name} to specify the table name.

    spark.read.format("hudi")
    .option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_SNAPSHOT_OPT_VAL) // Set the query type to the real-time view.
    .load("/tmp/default/cow_bugx/") // Specify the path of the Hudi table to read.
    .createTempView("mycall")
    spark.sql("select * from mycall").show(100)
  • Reading the incremental view (using Hive as an example and ${table_name} to specify the table name.)
    set hoodie.${table_name}.consume.mode=INCREMENTAL;  //Set incremental read.
    set hoodie.${table_name}.consume.max.commits=3;  // Specify the maximum number of commits to be consumed.
    set hoodie.${table_name}.consume.start.timestamp=20201227153030;  // Specify the initial commit to pull incremental views.
    select count(*) from default.${table_name} where `_hoodie_commit_time`>'20201227153030'; // This filtering condition must be added, and the value is the initial commit to pull incremental views.
  • Reading the incremental view (using SparkSQL as an example and ${table_name} to specify the table name.)
    set hoodie.${table_name}.consume.mode=INCREMENTAL;  //Set incremental read.
    set hoodie.${table_name}.consume.start.timestamp=20201227153030;  // Specify the initial commit to pull incremental views.
    set hoodie.${table_name}.consume.end.timestamp=20210308212318;  // Specify the end commit to pull incremental views. If this parameter is not specified, the latest commit is used.
    select count(*) from default.${table_name} where `_hoodie_commit_time`>'20201227153030'; // This filtering condition must be added, and the value is the initial commit to pull incremental views.
  • Reading the incremental view (using the Spark DataSource API as an example):

    QUERY_TYPE_OPT_KEY must be set to QUERY_TYPE_INCREMENTAL_OPT_VAL.

    spark.read.format("hudi")  
    .option(QUERY_TYPE_OPT_KEY, QUERY_TYPE_INCREMENTAL_OPT_VAL) // Set the query type to the incremental mode.
    .option(BEGIN_INSTANTTIME_OPT_KEY, "20210308212004")  // Specify the initial incremental pull commit.
    .option(END_INSTANTTIME_OPT_KEY, "20210308212318")  //: Specify the end commit of the incremental pull.
    .load("/tmp/default/cow_bugx/")  // Specify the path of the Hudi table to read.
    .createTempView("mycall")  // Register as a Spark temporary table.
    spark.sql("select * from mycall where `_hoodie_commit_time`>'20210308211131'")// Start the query. The statement is the same as the Hive incremental query statement.
    .show(100, false)
  • Reading the read-optimized view: The read-optimized view of COW tables is equivalent to the real-time view.