Impala Application Development Suggestions

Deploy Coordinators and Executors Separately, with Two to Five Coordinators for Each Cluster Depending on the Cluster Scale

The Coordinator caches metadata, parses SQL execution plans, and responds to client requests, and it mainly uses JVM memory. The Executor reads and writes data and calculates operators, and it mainly uses off-heap memory. The memory usage can be effectively improved after a splits. In addition, all SQL execution statistics are recorded in Coordinators. After splits, you can access several Coordinators to obtain the SQL execution status of the entire cluster, reducing the O&M pressure.

Configure Inclusive Queues for Core Services and Set Mem_limit and Exec_time_limit_s to Avoid Large Queries

Resource queues help prevent one service from taking away resources needed by another service. For details, see Enabling and Configuring a Dynamic Resource Pool for Impala.

Enable OBS Local Cache

OBS provides local cache that meets your data storage demands, improving the read speed. For example, you can configure a single-disk 100 GB local cache with data_cache=/srv/BigData/data1/impala:100 GB.

Enable HDFS Short-Circuit Read

HDFS allows you to enable short-circuit read to improve read speed. For details, see https://impala.apache.org/docs/build/html/topics/impala_config_performance.html.

Run Invalidate metadata <table> After Table Structure Is Changed, and When Data Is Imported to the Database or Lake, Refresh Changed Tables/partitions to Update the Impala Metadata

If a table is created or modified on a non-Impala engine (such as Hive and Spark), you need to run the Invalidate metadata <table> command on Impala to synchronize table schema information. Full metadata is synchronized only when the table is queried. For adding partitions and inserting data, you can run the refresh command to incrementally update metadata.

Run compute increment stats <table_name> Periodically to Update Common Table Statistics for Faster Query

Impala estimates the resources consumed by queries based on table statistics. Accurate statistics help Impala properly parse execution plans and allocate resources.

Merge Small Files Periodically to Reduce the Number of Files in a Single Table and Improve the Metadata Loading Speed

The amount of Impala metadata increases as the number of partitions and files grows. Too many partitions can consume excessive memory, leading to slower metadata updates and reduced query performance due to increased file scans.