Updated on 2024-12-13 GMT+08:00

Interconnecting Doris with the Hudi Data Source

When MRS cluster metadata is stored in Hive MetaStore, Doris 2.0.13 can connect to data sources of Hudi 0.15.0 using Catalog and supports all data field types of Hudi.

Hudi Table Types Can Be Queried by Doris

The supported Hudi table types and corresponding query types are as follows:

  • COW table: Snapshot Query and TimeTravel Query
  • MOR table: Snapshot Query, TimeTravel Query, and Read Optimized Query

The following uses TimeTravel Query as an example to describe how to query a Hudi COW table.

  1. Log in to the node where MySQL is installed and connect the Doris database.

    If Kerberos authentication is enabled for the cluster (the cluster is in security mode), run the following command to connect to the Doris database:

    export LIBMYSQL_ENABLE_CLEARTEXT_PLUGIN=1

    mysql -uDatabase login user -p -PConnection port for FE queries -hIP address of the Doris FE instance

    Enter the password for logging in to the database.

    • To obtain the query connection port of the Doris FE instance, you can log in to FusionInsight Manager, choose Cluster > Services > Doris > Configurations, and query the value of query_port of the Doris service.
    • To obtain the IP address of the Doris FE instance, log in to FusionInsight Manager of the MRS cluster and choose Cluster > Services > Doris > Instances to view the IP address of any FE instance.
    • You can also use the MySQL connection software or Doris web UI to connect the database.

  2. Create a Hudi Catalog, for example, hudi_catalog. For details about how to create, see 3.
  3. Connect the MySQL client to Doris by referring to 1 and run the following command to switch to the Hudi Catalog created in 2:

    switch hudi_catalog;

  4. Run the following command to view the created Hudi table:

    refresh catalog hudi_catalog;

    use Database name;

    show tables;

  5. Use TimeTravel Query to view the table data inserted before a certain time:

    select * from hudi_cow for time as of '20240409162842365' where id = 1;

Doris on Hudi Query Acceleration

Doris on Hudi supports the following query acceleration:

  • Bucket Shuffle Joins on Hudi Parquet Table

    Bucket Shuffle Join is available in Doris to accelerate the queries on Hudi data sources. This operation is supported for all Hudi field types and Hudi tables stored on OBS. The session-level variable enable_hudi_bucket_shuffle controls whether to enable this function, which is defaulted to disabled. You can connect to Doris and run the set enable_hudi_bucket_shuffle=true command to enable this function.

    • The inner join condition contains bucketing columns of two tables. When the bucketing columns of the left table are inner join conditions, there is a high probability that bucket shuffle join is planned.
    • The type of the bucketing column in the left table must be the same as that of the inner join column in the right table.
    • The bucket shuffle join function is available only in inner joins.
    • The bucket shuffle join function is available only when the left table has a single partition.
    • Two tables to be joined, one can be a Hudi table, and the other can be a Doris internal table.
  • Hudi Implicit Partitioning (Restricted Feature)

    Doris supports Hudi's implicit partitioning for query acceleration. You can use the enable_hudi_hidden_partition parameter to enable or disable this function (which is disabled by default). To enable it, run the ADMIN SET FRONTEND CONFIG ("enable_hudi_hidden_partition" = "true"); command after connecting to Doris.

  • Hudi Table Bucket Pruning (Restricted Feature)

    Doris supports Hudi bucket pruning for faster queries. You can use enable_hudi_bucket_prunning parameter to enable or disable this function (which is disabled by default). To enable it, run the ADMIN SET FRONTEND CONFIG ("enable_hudi_bucket_prunning" = "true"); command after connecting to Doris.

  1. Log in to the node where MySQL is installed and connect the Doris database.

    If Kerberos authentication is enabled for the cluster (the cluster is in security mode), run the following command to connect to the Doris database:

    export LIBMYSQL_ENABLE_CLEARTEXT_PLUGIN=1

    mysql -uDatabase login user -p -PConnection port for FE queries -hIP address of the Doris FE instance

    Enter the password for logging in to the database.

    • To obtain the query connection port of the Doris FE instance, you can log in to FusionInsight Manager, choose Cluster > Services > Doris > Configurations, and query the value of query_port of the Doris service.
    • To obtain the IP address of the Doris FE instance, log in to FusionInsight Manager of the MRS cluster and choose Cluster > Services > Doris > Instances to view the IP address of any FE instance.
    • You can also use the MySQL connection software or Doris web UI to connect the database.

  2. Create a Hudi Catalog, for example, hudi_catalog. For details about how to create, see 3.
  3. Connect to Doris on the MySQL client by referring to 1 and run the following command to switch to Hudi Catalog:

    switch hudi_catalog;

  4. Run the following commands to view the created Hudi table:

    refresh catalog hudi_catalog;

    use Database name;

    show tables;

  5. Enable Doris on Hudi query acceleration and perform a query.

    • Enable Bucket Shuffle Join and perform a query.

      Only Hudi bucket index tables are supported. That is, the Hudi table to be queried has the hoodie.index.type='BUCKET', hoodie.bucket.index.num.buckets='xxx' and hoodie.bucket.index.hash.field='xxx' properties.

      set enable_hudi_bucket_shuffle = true;

      select t1.name, t2.age from hudi_testt1 t1, hudi_testt2 t2 where t1.id=t2.id;

      Check whether the Bucket Shuffle Join is hit.

      explain select t1.name, t2.age from hudi_testt1 t1, hudi_testt2 t2 where t1.id=t2.id;

      Check whether the keyword BUCKET_SHUFFLE exists in the command output. If it exists, the query hits the Bucket Shuffle Join.

    • Enable implicit partitioning and query a Hudi table.

      Only implicitly partitioned Hudi tables are supported. That is, Hudi table to be queried has the hoodie.hidden.partitioning.rule = 'xxx' and hoodie.hidden.partitioning.enabled = 'true' properties.

      ADMIN SET FRONTEND CONFIG ("enable_hudi_hidden_partition" = "true");

      select * from hudi1 where col0 like '%2';

    • Enable the bucket pruning function and query a Hudi table.

      Only Hudi simple bucket index tables are supported. That is, the Hudi table to be queried has the hoodie.index.type = 'BUCKET', hoodie.bucket.index.num.buckets = 'xxx', hoodie.bucket.index.hash.field = 'xxx', hoodie.index.bucket.engine = 'SIMPLE', and hoodie.metadata.index.column.stats.enable = 'true' properties.

      ADMIN SET FRONTEND CONFIG ("enable_hudi_bucket_prunning" = "true");

      select * from hudi_cow_tbl where id > 1 and id < 3;