Updated on 2025-04-15 GMT+08:00

Impala Application Development Rules

Specify Only One Catalog and One StoreStore When Creating a Cluster

If two Catalogs and two StateStores have been created, specify --catalog_service_host and --state_store_host for the Impalad role and --state_store_host for the Catalog role.

Impala Coordinator's JVM Memory Match or Exceed Catalog's JVM Memory

Impala stores its metadata in memory. To stay up-to-date, Impalad must periodically retrieve the entire metadata set from the Catalog. Set the JVM memory for Impala to be larger than that of Catalog to accommodate metadata storage requirements.

Create No More than 100,000 Partitions When Creating a Table to Avoid Slow Metadata Loading and Query Blocking

The amount of Impala metadata increases as the number of partitions and files grows. Too many partitions can consume excessive memory, leading to slower metadata updates and reduced query performance due to increased file scans.

Do Not Add Zeros at the Beginning of Integer Partition Keys (Partition 'hour=01') When Creating a Table

Adding prefix zeros to integer partition keys causes Impala partition parsing incorrect, affecting metadata update.

Use English Column Names and Aliases Unless Otherwise Specified

Special characters can cause issues with parsing. Impala may encounter unrecognized symbols, leading to parsing failure or an infinite loop.

Do Not Nest Views or Subqueries Containing Case When for More Than Three Levels, Avoiding Impala Memory Overflow

When a case when clause has multiple branches, complexity grows rapidly, especially with nested queries or views. Our tests show that no more than three nesting levels are allowed; otherwise, memory overflows occur. You can use temporary tables instead of views or subqueries to split a multi-nested query into multiple queries for execution.

The Partition Key Must Be Specified in the Partition Table Select * Statement

If you do not specify the partition key in select *, Impala will scan the entire table, consuming significant compute resources. Always query data by partition unless scanning the full table is unavoidable.