Tuning Hudi Data Source Performance
HetuEngine can access data sources such as Hive and Hudi at a high speed. Hudi data source optimization includes Hudi table design optimization and cluster environment optimization.
Hudi Table Tuning
You can optimize table and data structures by referring to the following suggestions:
- Partition tables by the fields frequently used as filter conditions.
- If there mostly are equivalent queries with primary keys or primary key subsets, use bucket indexes to create tables and use query fields as bucketing keys.
- When querying a MOR table, periodically compact data to improve query performance. For details, see Compaction.
Cluster Environment Optimization
You can adjust the YARN configuration, cluster node resource configurations, metadata cache, and dynamic filter policies to optimize the system.
- For details about how to adjust the YARN configuration, see Adjusting the YARN Service Configuration.
- To adjust cluster node resource configurations, see Adjusting Cluster Node Resource Configurations.
- To adjust the metadata cache configuration, see Adjusting Metadata Cache.
- To adjust the dynamic filter policy, see Enabling Dynamic Filtering.
Example
A user stores device order information in a Hudi MOR table. The user can query the order details by order number. The number of orders per day is stable, and there are small peak hours during holidays.
- The order number is unique. In more than 80% queries, the order number is used for equivalent queries. The SQL statement is similar to select * from table where order_id = 'id1';.
- You can use day as the partition key since the number of orders per day is stable.
- Historical partition updates are not frequent, and main data is updated in new partitions.
Optimization suggestions
- Use bucket indexes to create tables with Spark-SQL. The index key is the order ID, and the partition key is the date.
- Compact logs periodically to improve query performance.
The following are example SQL statements:
set hoodie.compact.inline=true; set hoodie.schedule.compact.only.inline=true; set hoodie.run.compact.only.inline=false; create table hudi_mor (order_id int, comb int, col1 string, col2 string, dt int) using hudi partitioned by(dt) options(type='mor', primaryKey='order_id', preCombineField='comb', hoodie.index.type = 'BUCKET', hoodie.bucket.index.num.buckets=100, hoodie.bucket.index.hash.field = 'order_id')
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot