Updated on 2022-12-16 GMT+08:00

SQL on Hadoop

In DWS, you can directly read the structured data stored in HDFS, use the SQL query interfaces it provides, and perform complex analysis and query about the Hadoop native data using the vectorized executor.

Technical Architecture

DWS maps structured HDFS data to its foreign table. You can use the well-developed SQL compiler, vectorized executor, and SQL interfaces provided by the database to analyze and query native Hadoop data in HDFS.

DNs and HDFS data nodes can be deployed in the same physical cluster so that the CN can directly read data from the local HDFS data node, reducing the network overhead caused by remote HDFS read. DNs are also allowed to remotely access other HDFS data nodes.

The SQL compiler of DWS introduces a new component, Scheduler. If access to HDFS is required during SQL compiling, the Scheduler accesses the metadata of the HDFS NameNode to obtain the storage path of external files in HDFS. When the SQL engine of the CN is compiling query statements, it will invoke the Scheduler to obtain and specify the directory that contains the HDFS file to be accessed by each DN.

The allocation principles of the HDFS external data files (between DNs) are as follows:

  • A DN gives priority to data stored on its own physical server node when reading data.
  • Then, workload balancing between DNs is considered.
Figure 1 Logical architecture

Technical Highlights

  • Standard SQL query syntax

    The SQL query syntax inherited from DWS supports structured data analysis and query in HDFS, supports system functions such as connection, aggregation, and character date, supports subqueries and joint access to HDFS structured data and DWSS local data, and supports window functions.

  • HDFS data cost estimation model

    In addition to the cost-based SQL optimizer of DWS, the product uses the cost evaluation model for structured data access in the HDFS to formulate the best execution plan of HDFS data.

  • Intelligent scanning

    DWS can directly push predicates to the native Hadoop data stored on HDFS, filter predicates in the compressed data, and perform late materialization for the Hadoop Optimized Row Columnar (ORC) storage, reducing the amount of data to be read in the HDFS.

  • Low Level Virtual Machine (LLVM) optimization

    Predicates in native Hadoop data are scanned and optimized by LLVM. Then, intermediate representations (IRs) for their conditions are generated and further converted into machine codes. In this way, the performance of filtering and querying predicates is improved.

  • Informational constraints

    If the columns in a table are unique, you can specify information constraints for certain columns when you create this table in the database, facilitating queries during execution.

  • Vectorized executor

    DWS interconnects with its mature vectorized executor and improves analysis and query performance for structured ORC column data in the HDFS.

  • Partitioned tables

    The product is adapted to the partitioned-table data defined by the Hive syntax in the HDFS. It automatically prunes partitioned tables using the DWS SQL optimizer, improving analysis and query performance.

  • Highly efficient distributed reads of HDFS data

    The DWS SQL compiler uses the Scheduler to balance loads among DWS DNs that access HDFS data and uses the HDFS feature of short-circuit local read to improve data read performance.

    For more information, see CREATE FOREIGN TABLE (SQL on Hadoop or OBS).