Updated on 2024-08-28 GMT+08:00

Impala

Impala

Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS, HBase, or the Object Storage Service (OBS). In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Impala query UI in Hue) as Apache Hive. This provides a familiar and unified platform for real-time or batch-oriented queries. Impala is an addition to tools available for querying big data. Impala does not replace the batch processing frameworks built on MapReduce such as Hive. Hive and other frameworks built on MapReduce are best suited for long running batch jobs.

Impala provides the following features:

  • Most common SQL-92 features of Hive Query Language (HQL) including SELECT, JOIN, and aggregate functions
  • HDFS, HBase, and OBS storage, including:
    • HDFS file formats: delimited text files, Parquet, Avro, SequenceFile, and RCFile
    • Compression codecs: Snappy, GZIP, Deflate, BZIP
  • Common data access interfaces including:
    • JDBC driver
    • ODBC driver
    • Hue Beeswax and the Impala query UI
  • Impala-shell command line interface
  • Kerberos authentication

Impala applies to offline analysis (such as log and cluster status analysis) of real-time data queries, large-scale data mining (such as user behavior analysis, interest region analysis, and region display), and other scenarios.

For details about Impala, visit https://impala.apache.org/impala-docs.html.

Impala consists of three roles: Impala Daemon (Impalad), Impala StateStore, and Impala Catalog Service.

Impala Daemon

The core Impala component is the Impala daemon, physically represented by the impalad process.

A few of the key functions that an Impala daemon performs are:

  • Runs on all data nodes.
  • Reads and writes to data files.
  • Accepts queries transmitted from the Impala-shell command, Hue, JDBC, or ODBC.
  • Parallelizes the queries and transmits intermediate query results back to the central coordinator.
  • Invokes a node to return the query results to the client.

The Impala daemons are in constant communication with StateStore, to confirm which daemons are healthy and can accept new work.

Impala StateStore

The Impala component known as the StateStore checks on the health of all Impala daemons in a cluster, and continuously relays its findings to each of those daemons. It is physically represented by a daemon process named statestored. You only need such a process on one host in a cluster. If an Impala daemon goes offline due to hardware failure, network error, software issue, or other reason, the StateStore informs all the other Impala daemons so that future queries can avoid making requests to the unreachable Impala daemon.

Impala Catalog Service

The Impala component known as the Catalog Service relays the metadata changes from Impala SQL statements to all the Impala daemons in a cluster. It is physically represented by a daemon process named catalogd. When you create a table, load data, and so on through Hive, you do need to issue REFRESH or INVALIDATE METADATA on an Impala daemon before executing a query there. The catalog service avoids the need to issue REFRESH and INVALIDATE METADATA statements when the metadata changes are performed by statements issued through Impala.

Relationships with other components

  • Hadoop Distributed File System (HDFS)

    Impala uses the HDFS to store files. It parses and processes structured data with highly reliable underlying storage supported by HDFS. Impala does not move data in HDFS and provides faster access.

  • Hive

    Impala uses Hive metadata, Open Database Connectivity (ODBC) driver, and SQL syntax. Unlike Hive, which is over MapReduce, Impala implements a distributed architecture based on daemon and handles all query executions on the same node. Therefore, Impala is faster than Hive by reducing the latency caused by MapReduce.

  • Kudu

    Kudu is closely integrated with Impala to replace the combination of Impala, HDFS, and Parquet, allowing you to insert, query, update, and delete data from Kudu tablets using Impala's SQL syntax. In addition, you can use JDBC or ODBC. Impala functions as a proxy to connect to Kudu for data operations.

  • HBase

    By default, Impala tables use data files stored in HDFS to facilitate batch loading and query in full table scans. However, HBase can provide convenient and efficient query of OLTP-style organization data.