SQL Execution Plan

A SQL execution plan is a node tree that displays the detailed steps performed when DataArts Fabric SQL executes a SQL statement.

You can run EXPLAIN to view the specific execution plan generated by the optimizer for each query. EXPLAIN outputs a line for each execution node, displaying the basic node type and the estimated cost value calculated by the optimizer for executing this node.

Execution Plan Display Information

In addition to setting different display formats for the execution plan, various EXPLAIN methods can also be used to show execution plan information at different levels of detail. Common ones include the following, and for more usage, refer to EXPLAIN.

EXPLAIN statement: only generates the execution plan without actual execution. Here, statement indicates the SQL statement.
EXPLAIN ANALYZE statement: generates the execution plan, performs execution, and displays summary information of the execution. Actual runtime statistics are included in the display, such as the total time spent inside each planning node (in milliseconds) and the actual number of rows it returns.
EXPLAIN PERFORMANCE statement: generates the execution plan, performs execution, and displays all information during the execution.

To measure the runtime cost of each node in the execution plan, EXPLAIN ANALYZE or EXPLAIN PERFORMANCE adds the overhead of performance analysis to the current query execution. Running EXPLAIN ANALYZE or EXPLAIN PERFORMANCE on a query may take significantly more time than a regular query. The extent of the excess depends on the nature of the query and the platform used.

Therefore, when troubleshooting slow SQL execution issues, if the SQL runs for a long time without completion, you are advised to use the EXPLAIN command to view the execution plan for preliminary diagnosis. If the SQL can be executed successfully, you are advised to run EXPLAIN ANALYZE or EXPLAIN PERFORMANCE to examine the execution plan and its actual runtime information for more precise identification of the issue's cause.

Description of common keywords in execution plans:

Table access methods
ForeignScan: full table sequential scan. The most basic scan operator, used for sequential scanning of external tables, supporting value partition pruning.
Table join methods
- Nested Loop
  Nested loops, suitable for queries where the connected data subsets are small. In nested loops, the outer table drives the inner table, and every row returned by the outer table must be searched in the inner table to find its matching row. Hence, the result set returned by the entire query should not be too large (not exceeding 10,000), and the table returning the smaller subset should serve as the outer table, with indexes preferably added to the connection fields of the inner table.
- (Sonic) Hash Join
  Hash joins, appropriate for joining large datasets. The optimizer uses the smaller of the two tables to construct a hash table in memory using the join keys, then scans the larger table and probes the hash to find matches.
Operators
- sort
  Sorts the result sets.
- filter
  EXPLAIN output displays. The WHERE clause is used as a filter condition and is attached to the plan node of a sequential scan. This means that the planner checks the condition for each row scanned by the node and outputs only the rows that meet the condition. Although the expected number of output rows is reduced due to the WHERE clause, sequential scans still need to access all 10,000 rows. Therefore, the overall cost is not reduced. Instead, additional CPU time (specifically, 10000 × cpu_operator_cost) is incurred due to checking the WHERE condition.
- LIMIT
  Limits the number of output records in the execution results. If LIMIT is applied, not all rows will be retrieved.

Execution Plan Display Format

The execution plan of DataArts Fabric SQL is displayed in a clear format. The plan contains the plan node ID, and the performance analysis is simple and direct. See Figure 1.

Figure 1 Example of an execution plan in pretty format
Click to enlarge

Stream Plan

In DataArts Fabric SQL, the Stream plan is used to execute queries.

The CoordinatorNode (CN) generates a plan based on the original statement and delivers the plan to DNs for execution. During the execution of each DN, the Stream operator is used for data interaction.

The existing tables tt01 and tt02 are defined as follows:

    
         CREATE TABLE tt01(c1 int, c2 int)

store as orc;

CREATE TABLE tt02(c1 int, c2 int)
store as orc;

Two tables are joined, the join condition contains non-distribution columns, and data is exchanged between DNs. In this case, the base table of the tt02 table is scanned on each DN. After the scanning, the Redistribute Stream operator is used to perform hash calculation based on tt02.c1 in the JOIN condition, and the result is sent to each DN. Then, JOIN is performed on each DN. Finally, the data is summarized to the CN.

Figure 2 Example stream plan
Click to enlarge

EXPLAIN PERFORMANCE Description

During SQL optimization, you need to run EXPLAIN ANALYZE or EXPLAIN PERFORMANCE to view the actual execution information of SQL statements and compare the actual execution information with the optimizer's estimation to provide a basis for optimization. Compared with EXPLAIN ANALYZE, EXPLAIN PERFORMANCE adds execution information on each DN.

The SQL query statement in the preceding part is used as an example.

    
         SELECT * FROM tt01,tt02 WHERE tt01.c1=tt02.c1;

The output of EXPLAIN PERFORMANCE consists of the following parts:

Execution plan

Figure 3 Example explain performance displays
Click to enlarge

The plan is displayed in a table, including 11 columns: id, operation, A-time, A-rows, E-rows, E-distinct, Peak Memory, E-memory, A-width, E-width, and E-costs. Table 1 describes the columns.

**Table 1** Execution column descriptions
Column	Description
id	ID of an execution operator node.
operation	Name of an execution operator. The operator with the Vector prefix refers to a vectorized execution engine operator, which is generally used in a query containing a column-store table. Streaming is a special operator that implements the core data shuffle function of the distributed architecture. Streaming has three types, which correspond to different data shuffle functions. Streaming (type: GATHER): The coordinator collects data from DNs. Streaming(type: REDISTRIBUTE): redistributes data to all DNs based on the selected columns. Streaming (type: BROADCAST): broadcasts the data of the current DN to all other DNs.
A-time	Execution time of an operator on each DN. Generally, A-time of an operator executed on a DN is two values enclosed by square brackets ([]), indicating the shortest time and longest time for completing the operator on all DNs, including the execution time of lower-layer operators. Note: In the entire plan, the execution time of a leaf node is the execution time of the operator itself, and the execution time of other operators includes the execution time of its child nodes.
A-rows	Total number of global rows output by the corresponding operator.
E-rows	Number of output rows estimated by each operator.
E-distinct	Estimated distinct value of the hashjoin operator.
Peak Memory	Peak memory used by the operator on each DN. The left part of [] indicates the minimum value, and the right part indicates the maximum value.
E-memory	Estimated memory usage of each operator on a DN. Only operators executed on DNs are displayed. In certain scenarios, the upper limit of the memory that can be automatically expanded when the memory resources are sufficient is displayed in parentheses after the estimated memory usage.
A-width	Actual width of each tuple of the current operator. It is displayed only for heavy-memory operators, including (Vec)HashJoin, (Vec)HashAgg, (Vec) HashSetOp, (Vec)Sort, and (Vec) Materialize. The width calculated by (Vec) HashJoin is the width of the right subtree operator, is displayed in the right subtree.
E-width	Estimated width of the output tuple of each operator.
E-costs	Estimated execution cost of each operator. E-costs is measured by the optimizer based on the unit defined by the cost parameter. Generally, the disk page is fetched as one unit. Other cost parameters are set based on E-costs. The cost (E-costs value) of each node includes the cost of all its child nodes. The cost reflects only what the optimizer cares about and does not consider the time when the result row is transferred to the client. Although this time may be a significant component of the actual total time, it is ignored by the optimizer because it cannot be changed by modifying the plan.

Predicate Information (identified by plan id)

Predicate filtering displays the filter criteria of the corresponding execution operator node, that is, the information that does not change during the entire plan execution process, mainly some join conditions and filter information. For a partitioned table, partition pruning information is also displayed.
Memory Information (identified by plan id)

The memory usage information is the memory usage information of the operators that will be printed in the entire plan. The memory usage information includes the peak memory of the operators, estimated memory of the optimizer, and control memory of the Hash and Sort operators, estimated memory usage (operator memory), actual width (width), number of automatic memory expansion times (auto spread num), whether to perform early spilled, and spill information (including spill time (s)), number of inner/outer partition spills (inner/outer partition spill num), number of spill files (temp file num), amount of data spilled, and minimum and maximum data volumes spilled (written disk I/O [min, max]). The Sort operator does not display the specific number of spill files. It only shows Disk when displaying the sorting method. Below is an example of the memory information of an operator that has been spilled.
Targetlist Information (identified by plan id)

This part displays the output target column information of each operator.
DataNode Information (identified by plan id)

This part prints out the execution time of each operator (if filtering and projection are included, their execution times are also displayed), CPU usage, and buffer usage.
- Operator execution information
  The execution information of each operator consists of three parts:
  - executor0_es_group/executor1_es_group indicates the node information where the command is executed. The information in parentheses is the actual execution information.
  - actual time indicates the actual execution time. The first number indicates the time taken from entering the current operator to outputting the first data record, and the second number represents the total execution time of all data records.
  - rows indicate the number of output rows of the current operator.
  - loops indicate the number of executions of the current operator. Note that for a partitioned table, scanning each partition counts as a full scan operation. Switching to the next partition initiates a new scan operation.
- CPU information
  
  Each operator's execution process includes CPU information, where cyc indicates the number of CPU cycles, ex cyc indicates the cycle count of the current operator excluding its child nodes, inc cyc indicates the cycle count including child nodes, ex row indicates the number of data rows output by the current operator, and ex c/r indicates the average number of cycles per data record calculated by ex cyc or ex row.
- Buffer information
  
  buffers displays buffer information, including reads and writes on shared blocks and temporary blocks.
  
  Shared blocks include tables and indexes, while temporary blocks are disk blocks used in sorting and materialization. The block data displayed at the higher-level node includes the number of blocks used by all its child nodes.
  
  For operators that have been spilled, the buffer information shows the amount of data spilled. temp read indicates the number of times temporary data was read, written indicates the number of times temporary data was written to disk, and written_size indicates the amount of data spilled.
  
  When the GUC switch enable_spill_to_remote_storage for the spill-to-OBS feature is turned off, data is spilled into the DN instance directory, with the data displayed in the following format:
  
  When the GUC switch enable_spill_to_remote_storage for the spill-to-OBS feature is turned on, additional spill-related statistical information is displayed, as shown in the example below. Among them, written_disk_size indicates the amount of data spilled to the disk cache. written_obs_size indicates the amount of data directly spilled to OBS when the disk cache space is insufficient. spill_obs_size indicates the amount of data written back to OBS from the disk cache when the disk cache space is insufficient.
- Disk cache information
  - Disk Cache: indicates the hit information and data read information of the disk cache.
    - miss: indicates the number of disk cache misses.
    - hit: indicates the number of disk cache hits.
    - disk_cache_error_code and error indicate the number of times an error code is generated.
    - scanBytes: indicates the data volume queried by the scan.
    - remoteReadBytes: indicates the amount of data read from OBS.
    - loadTime: indicates the time taken to load data from the disk cache.
    - openTime: indicates the time taken to open the disk cache file.
    - preadTime: indicates the time taken to read data from the disk.
      Note: To improve the efficiency of OBS, adjacent request blocks are combined, or the minimum granularity of the request for writing data to the disk cache is block (1 MB by default). As a result, the value of scanBytes may be less than that of remoteReadBytes.
  - ReadAhead: indicates the information about data prefetch.
    - parseMetaTime: indicates the time taken to parse file metadata during prefetching.
    - submitBytes indicates the amount of prefetched data.
    - submitTime indicates the submission time of the data prefetch request.
    - waitTime: indicates the time spent waiting for the completion of a prefetch request when it is hit during data reading.
    - waitCount: indicates the number of times executing prefetch requests are hit during data reading.
    - cancelCount: indicates the number of times pending prefetch requests are hit and canceled during data reading.
    - hitCount: indicates the number of times completed prefetch requests are hit during data reading.
    - fabricCacheHitCount L1Cache: indicates the number of hits in the L1 layer data cache.
    - L2Cache: indicates the number of hits in the L2 layer data cache.
  - OBS I/O: details about an OBS I/O request.
    1. count: indicates the total number of OBS I/O requests.
    2. averageRTT: indicates the average round trip time (RTT) for OBS I/O requests, in microseconds.
    3. averageLatency: indicates the average latency for OBS I/O requests, in microseconds.
    4. latencyGt1s: indicates the number of OBS I/O requests with latency exceeding 1 second.
    5. latencyGt10s indicates the number of OBS I/O requests with latency exceeding 10 seconds.
    6. retryCount: indicates the total number of retries for OBS I/O requests.
    7. rateLimitCount: indicates the total number of times OBS I/O requests were rate-limited.
- YuanRong API invocation information
  
  The above figure lists the statistics of YuanRong API invocations on all executor actors and provides the overall information at the end.
  
  Each row indicates the statistical information of an API. Refer to the fields of the yr_api_status() function for the meaning of each field.
- Query Summary
  
  This part displays statement-level execution information, including the execution time in each phase, maximum and minimum execution time in the initialization and end phases on each DN, initialization, execution, and end time on the CN, available system memory, estimated statement memory, and DOP during statement execution.
  - DN execution information
    - DataNode executor start time: DN executor start time. The format is [min_node_name, max_node_name]: [min_time, max_time].
    - DataNode executor run time: DN executor running duration. The format is [min_node_name, max_node_name]: [min_time, max_time].
    - DataNode executor end time: DN executor end time. The format is [min_node_name, max_node_name]: [min_time, max_time].
  - Remote query poll time: poll waiting time for receiving results.
  - Memory estimation information
    - System available mem: available system memory.
    - Query Max mem: maximum query memory.
    - Query estimated mem: estimated statement memory.
  - SMP adaptive concurrency information (displayed only when SMP adaptation is enabled)
    - Initial DOP: DOP of the initial plan generated.
    - Avail max core: number of vCPUs available for statement execution.
    - Final Max DOP: maximum DOP of operators in the plan.
  - LakeFormation access information: includes the information access time and frequency.
  - CN execution time
    - Coordinator executor start time: CN executor start time.
    - Coordinator executor run time: CN executor running duration.
    - Coordinator executor end time: CN executor end time.
  - Parser runtime: parser running duration.
  - YR utilize function analyse: analysis of YuanRong's function calling, including:
    - Actor invoke runtime: time when an actor is started.
    - Actor Distribution Info: IP address and sequence number of the actor.
    - Actor Stream topo create time: stream start time of the actor.
    - Actor Consumer close time: time when the consumer of the actor is closed.
    - Actor Stream Producer/consumer count: number of streams of the actor.
  - Planner runtime: optimizer execution duration.
  - Query Id: query ID.
  - Total runtime: total execution duration.

The difference between A-rows and E-rows reflects the deviation between the optimizer's estimation and actual execution. Generally, the larger the deviation, the more unreliable the plan generated by the optimizer is, and the more necessary manual intervention becomes.
The larger the discrepancy in the A-time values, the greater the computational skew (difference in execution times across different DNs) for this operator, increasing the necessity for manual tuning. Typically, for two adjacent operators, the execution time of the higher-level operator includes that of the lower-level one. However, if the higher-level operator is a stream operator, its execution time might be shorter than that of the lower-level operator due to the absence of thread dependencies, meaning there is no inclusive relationship.
Max Query Peak Memory is frequently used to estimate the memory consumption of SQL statements and serves as a crucial reference for setting runtime memory parameters during SQL tuning. Usually, the output from EXPLAIN ANALYZE or EXPLAIN PERFORMANCE is used as input for further optimization.