SQL Execution Plan
A SQL execution plan is a node tree that displays the detailed steps performed when DataArts Fabric SQL executes a SQL statement.
You can run EXPLAIN to view the specific execution plan generated by the optimizer for each query. EXPLAIN outputs a line for each execution node, displaying the basic node type and the estimated cost value calculated by the optimizer for executing this node.
Execution Plan Display Information
In addition to setting different display formats for the execution plan, various EXPLAIN methods can also be used to show execution plan information at different levels of detail. Common ones include the following, and for more usage, refer to EXPLAIN.
- EXPLAIN statement: only generates the execution plan without actual execution. Here, statement indicates the SQL statement.
- EXPLAIN ANALYZE statement: generates the execution plan, performs execution, and displays summary information of the execution. Actual runtime statistics are included in the display, such as the total time spent inside each planning node (in milliseconds) and the actual number of rows it returns.
- EXPLAIN PERFORMANCE statement: generates the execution plan, performs execution, and displays all information during the execution.
To measure the runtime cost of each node in the execution plan, EXPLAIN ANALYZE or EXPLAIN PERFORMANCE adds the overhead of performance analysis to the current query execution. Running EXPLAIN ANALYZE or EXPLAIN PERFORMANCE on a query may take significantly more time than a regular query. The extent of the excess depends on the nature of the query and the platform used.
Therefore, when troubleshooting slow SQL execution issues, if the SQL runs for a long time without completion, you are advised to use the EXPLAIN command to view the execution plan for preliminary diagnosis. If the SQL can be executed successfully, you are advised to run EXPLAIN ANALYZE or EXPLAIN PERFORMANCE to examine the execution plan and its actual runtime information for more precise identification of the issue's cause.
Description of common keywords in execution plans:
- Table access methods
- Table join methods
- Nested Loop
Nested loops, suitable for queries where the connected data subsets are small. In nested loops, the outer table drives the inner table, and every row returned by the outer table must be searched in the inner table to find its matching row. Hence, the result set returned by the entire query should not be too large (not exceeding 10,000), and the table returning the smaller subset should serve as the outer table, with indexes preferably added to the connection fields of the inner table.
- (Sonic) Hash Join
Hash joins, appropriate for joining large datasets. The optimizer uses the smaller of the two tables to construct a hash table in memory using the join keys, then scans the larger table and probes the hash to find matches.
- Nested Loop
- Operators
- sort
- filter
The EXPLAIN output shows the WHERE clause attached as a filter condition to the sequential scan plan node. This implies that the planner node checks the condition for each scanned row and only outputs those meeting the criteria. Although the expected output row count decreases due to the WHERE clause, accessing all 10,000 rows remains necessary, thus not reducing the overhead. Instead, it slightly increases (specifically, by 10,000 x cpu_operator_cost) to account for additional CPU time verifying the WHERE condition.
- LIMIT
Limits the number of output records in the execution results. Adding LIMIT ensures not all rows are retrieved.
Execution Plan Display Format
Stream Plan
In DataArts Fabric SQL, the Stream plan is used to execute queries.
The CoordinatorNode (CN) generates a plan based on the original statement and delivers the plan to DNs for execution. During the execution of each DN, the Stream operator is used for data interaction.
The existing tables tt01 and tt02 are defined as follows:
1 2 3 4 5 6 |
CREATE TABLE tt01(c1 int, c2 int) store as orc; CREATE TABLE tt02(c1 int, c2 int) store as orc; |
Two tables are joined, the join condition contains non-distribution columns, and data is exchanged between DNs. In this case, the base table of the tt02 table is scanned on each DN. After the scanning, the Redistribute Stream operator is used to perform hash calculation based on tt02.c1 in the JOIN condition, and the result is sent to each DN. Then, JOIN is performed on each DN. Finally, the data is summarized to the CN.
EXPLAIN PERFORMANCE Description
During SQL optimization, you need to run EXPLAIN ANALYZE or EXPLAIN PERFORMANCE to view the actual execution information of SQL statements and compare the actual execution information with the optimizer's estimation to provide a basis for optimization. Compared with EXPLAIN ANALYZE, EXPLAIN PERFORMANCE adds execution information on each DN.
The SQL query statement in the preceding part is used as an example.
1
|
SELECT * FROM tt01,tt02 WHERE tt01.c1=tt02.c1; |
The output of EXPLAIN PERFORMANCE consists of the following parts:
- Execution plan
The plan is displayed in a table, including 11 columns: id, operation, A-time, A-rows, E-rows, E-distinct, Peak Memory, E-memory, A-width, E-width, and E-costs. Table 1 describes the columns.
Table 1 Execution column descriptions Column
Description
id
ID of an execution operator node.
operation
Name of an execution operator.
The operator with the Vector prefix refers to a vectorized execution engine operator, which is generally used in a query containing a column-store table.
Streaming is a special operator that implements the core data shuffle function of the distributed architecture. Streaming has three types, which correspond to different data shuffle functions.- Streaming (type: GATHER): The coordinator collects data from DNs.
- Streaming(type: REDISTRIBUTE): redistributes data to all DNs based on the selected columns.
- Streaming (type: BROADCAST): broadcasts the data of the current DN to all other DNs.
A-time
Execution time of an operator on each DN. Generally, A-time of an operator executed on a DN is two values enclosed by square brackets ([]), indicating the shortest time and longest time for completing the operator on all DNs, including the execution time of lower-layer operators.
Note: In the entire plan, the execution time of a leaf node is the execution time of the operator itself, and the execution time of other operators includes the execution time of its child nodes.
A-rows
Total number of global rows output by the corresponding operator.
E-rows
Number of output rows estimated by each operator.
E-distinct
Estimated distinct value of the hashjoin operator.
Peak Memory
Peak memory used by the operator on each DN. The left part of [] indicates the minimum value, and the right part indicates the maximum value.
E-memory
Estimated memory usage of each operator on a DN. Only operators executed on DNs are displayed. In certain scenarios, the upper limit of the memory that can be automatically expanded when the memory resources are sufficient is displayed in parentheses after the estimated memory usage.
A-width
Actual width of each tuple of the current operator. It is displayed only for heavy-memory operators, including (Vec)HashJoin, (Vec)HashAgg, (Vec) HashSetOp, (Vec)Sort, and (Vec) Materialize. The width calculated by (Vec) HashJoin is the width of the right subtree operator, is displayed in the right subtree.
E-width
Estimated width of the output tuple of each operator.
E-costs
Estimated execution cost of each operator.- E-costs is measured by the optimizer based on the unit defined by the cost parameter. Generally, the disk page is fetched as one unit. Other cost parameters are set based on E-costs.
- The cost (E-costs value) of each node includes the cost of all its child nodes.
- The cost reflects only what the optimizer cares about and does not consider the time when the result row is transferred to the client. Although this time may be a significant component of the actual total time, it is ignored by the optimizer because it cannot be changed by modifying the plan.
- Predicate Information (identified by plan id)
Predicate filtering displays the filter criteria of the corresponding execution operator node, that is, the information that does not change during the entire plan execution process, mainly some join conditions and filter information. For a partitioned table, partition pruning information is also displayed.
- Memory Information (identified by plan id)
The memory usage information is the memory usage information of the operators that will be printed in the entire plan. The memory usage information includes the peak memory of the operators, estimated memory of the optimizer, and control memory of the Hash and Sort operators, estimated memory usage (operator memory), actual width (width), number of automatic memory expansion times (auto spread num), whether to perform early spilled, and spill information (including spill time (s)), number of inner/outer partition spills (inner/outer partition spill num), number of spill files (temp file num), amount of data spilled, and minimum and maximum data volumes spilled (written disk I/O [min, max]). The Sort operator does not display the specific number of spill files. It only shows Disk when displaying the sorting method. Below is an example of the memory information of an operator that has been spilled.
- Targetlist Information (identified by plan id)
This part displays the output target column information of each operator.
- DataNode Information (identified by plan id)
This part prints out the execution time of each operator (if filtering and projection are included, their execution times are also displayed), CPU usage, and buffer usage.
- Operator execution information
The execution information of each operator consists of three parts:
- executor0_es_group/executor1_es_group indicates the node information where the command is executed. The information in parentheses is the actual execution information.
- actual time indicates the actual execution time. The first number indicates the time taken from entering the current operator to outputting the first data record, and the second number represents the total execution time of all data records.
- rows indicate the number of output rows of the current operator.
- loops indicate the number of executions of the current operator. Note that for a partitioned table, scanning each partition counts as a full scan operation. Switching to the next partition initiates a new scan operation.
- CPU information
Each operator's execution process includes CPU information, where cyc indicates the number of CPU cycles, ex cyc indicates the cycle count of the current operator excluding its child nodes, inc cyc indicates the cycle count including child nodes, ex row indicates the number of data rows output by the current operator, and ex c/r indicates the average number of cycles per data record calculated by ex cyc or ex row.
- Buffer information
buffers displays buffer information, including reads and writes on shared blocks and temporary blocks.
Shared blocks include tables and indexes, while temporary blocks are disk blocks used in sorting and materialization. The block data displayed at the higher-level node includes the number of blocks used by all its child nodes.
For operators that have been spilled, the buffer information shows the amount of data spilled. temp read indicates the number of times temporary data was read, write indicates the number of times temporary data was written to disk, and written_size indicates the amount of data spilled.
When the GUC switch enable_spill_to_remote_storage for the spill-to-OBS feature is turned off, data is spilled into the DN instance directory, with the data displayed in the following format:
When the GUC switch enable_spill_to_remote_storage for the spill-to-OBS feature is turned on, additional spill-related statistical information is displayed, as shown in the example below. Among them, written_disk_size indicates the amount of data spilled to the disk cache. written_obs_size indicates the amount of data directly spilled to OBS when the disk cache space is insufficient. spill_obs_size indicates the amount of data written back to OBS from the disk cache when the disk cache space is insufficient.
- Disk cache information
Disk Cache: indicates the hit information and data read information of the disk cache.
miss indicates the number of disk cache misses. hit indicates the number of disk cache hits. disk_cache_error_code and error indicate the number of error codes. scanBytes indicates the data volume queried by the scan. remoteReadBytes indicates the data volume read from OBS. loadTime indicates the time for loading data from the disk cache. openTime indicates the time for opening the disk cache file. preadTime indicates the time for reading data from the disk. To improve the efficiency of OBS, adjacent request blocks are combined, or the minimum granularity of the request for writing data to the disk cache is block (1 MB by default). As a result, the value of scanBytes may be less than that of remoteReadBytes.
ReadAhead: indicates the information about data prefetch.
parseMetaTime indicates the time for parsing file metadata during prefetch. submitBytes indicates the amount of prefetched data. submitTime indicates the time for submitting the data prefetch request. waitTime indicates the time for waiting for the prefetch request to be completed when the data is read. waitCount indicates the number of data prefetch requests that are hit during data reading. cancelCount indicates the number of data prefetch requests that are hit during data reading and have not been executed. hitCount indicates the number of data prefetch requests that are hit during data reading. fabricCacheHitCount has two options: L1Cache and L2Cache. L1Cache: indicates the number of hits of the data cache at the L1 layer. L2Cache: indicates the number of hits of the data cache at the L2 layer.
OBS I/O: details about an OBS I/O request.
count indicates the total number of OBS I/O requests. averageRTT indicates the average round trip time (RTT) of OBS I/O requests. The unit is μs. averageLatency indicates the average latency of OBS I/O requests. The unit is μs. latencyGt1s indicates the number of OBS I/O requests whose latency exceeds 1s. latencyGt10s indicates the number of OBS I/O requests whose latency exceeds 10s. retryCount indicates the total number of OBS I/O request retries. rateLimitCount indicates the total number of OBS I/O requests under flow control.
- Operator execution information
- Query Summary
This part displays statement-level execution information, including the execution time in each phase, maximum and minimum execution time in the initialization and end phases on each DN, initialization, execution, and end time on the CN, available system memory, estimated statement memory, and DOP during statement execution.
- DN execution information
- DataNode executor start time: DN executor start time. The format is [min_node_name, max_node_name]: [min_time, max_time].
- DataNode executor run time: DN executor running duration. The format is [min_node_name, max_node_name]: [min_time, max_time].
- DataNode executor end time: DN executor end time. The format is [min_node_name, max_node_name]: [min_time, max_time].
- Remote query poll time: poll waiting time for receiving results.
- Memory estimation information
- System available mem: available system memory.
- Query Max mem: maximum query memory.
- Query estimated mem: estimated statement memory.
- SMP adaptive concurrency information (displayed only when SMP adaptation is enabled)
- Initial DOP: DOP of the initial plan generated.
- Avail max core: number of vCPUs available for statement execution.
- Final Max DOP: maximum DOP of operators in the plan.
- CN execution time
- Coordinator executor start time: CN executor start time.
- Coordinator executor run time: CN executor running duration.
- Coordinator executor end time: CN executor end time.
- Parser runtime: parser running duration.
- Planner runtime: optimizer execution duration.
- Query Id: query ID.
- Total runtime: total execution duration.
- DN execution information

- The difference between A-rows and E-rows reflects the deviation between the optimizer's estimation and actual execution. Generally, the larger the deviation, the more unreliable the plan generated by the optimizer is, and the more necessary manual intervention becomes.
- The larger the discrepancy in the A-time values, the greater the computational skew (difference in execution times across different DNs) for this operator, increasing the necessity for manual tuning. Typically, for two adjacent operators, the execution time of the higher-level operator includes that of the lower-level one. However, if the higher-level operator is a stream operator, its execution time might be shorter than that of the lower-level operator due to the absence of thread dependencies, meaning there is no inclusive relationship.
- Max Query Peak Memory is frequently used to estimate the memory consumption of SQL statements and serves as a crucial reference for setting runtime memory parameters during SQL tuning. Usually, the output from EXPLAIN ANALYZE or EXPLAIN PERFORMANCE is used as input for further optimization.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot