Updated on 2024-10-09 GMT+08:00

Tuning Hudi Data Source Performance

This section applies to MRS 3.3.1 or later.

HetuEngine can access data sources such as Hive and Hudi at a high speed. Hudi data source optimization includes Hudi table design optimization and cluster environment optimization.

Hudi Table Tuning

You can optimize table and data structures by referring to the following suggestions:

  • Partition tables by the fields frequently used as filter conditions.
  • If there mostly are equivalent queries with primary keys or primary key subsets, use bucket indexes to create tables and use query fields as bucketing keys.
  • When querying a MOR table, periodically compact data to improve query performance. For details, see COMPACTION.

Cluster Environment Optimization

You can adjust the YARN configuration, cluster node resource configurations, metadata cache, and dynamic filter policies to optimize the system.

Example

A user stores device order information in a Hudi MOR table. The user can query the order details by order number. The number of orders per day is stable, and there are small peak hours during holidays.

  • The order number is unique. In more than 80% queries, the order number is used for equivalent queries. The SQL statement is similar to select * from table where order_id = 'id1';.
  • You can use day as the partition key since the number of orders per day is stable.
  • Historical partition updates are not frequent, and main data is updated in new partitions.

Optimization suggestions

  1. Use bucket indexes to create tables with Spark-SQL. The index key is the order ID, and the partition key is the date.
  2. Compact logs periodically to improve query performance.

The following are example SQL statements:

set hoodie.compact.inline=true;
set hoodie.schedule.compact.only.inline=true;
set hoodie.run.compact.only.inline=false;
create table hudi_mor (order_id int, comb int, col1 string, col2 string, dt int)
using hudi
partitioned by(dt)
options(type='mor', primaryKey='order_id', preCombineField='comb',
hoodie.index.type = 'BUCKET',
hoodie.bucket.index.num.buckets=100,
hoodie.bucket.index.hash.field = 'order_id')