Updated on 2026-02-06 GMT+08:00

Doris Data Query Rules

This topic describes the rules and suggestions for querying Doris data.

Data Query Rules

  • [Mandatory] Modify the query to a subquery if the in conditions exceed 2,000.
  • [Mandatory] Refrain from using the REST API for executing numerous SQL queries. It is intended solely for cluster maintenance.
  • [Optional] When INSERT INTO SELECT statements exceed 100 million, divide them into smaller batches for execution. To expedite data import during idle cluster resources, adjust concurrency settings.

    For example, setting parallel_fragment_exec_instance_num to 8, which is half the CPU cores on a single Backend (BE), is recommended.

  • [Mandatory] For result sets larger than 50,000, utilize JDBC Catalog or the outfile method for export to prevent overloading Frontend (FE) resources and ensure cluster stability.
    • Employ pagination (offset limit) with an ORDER BY clause for interactive queries.
    • Consider outfile or export methods when exporting data to third parties.
  • [Mandatory] Utilize Colocation Join for joining more than two tables with over 300 million records.
  • [Mandatory] Avoid using SELECT * for querying large tables with hundreds of millions of records and specify the required fields instead.
    • Use the SQL Block method to forbid broad queries.
    • For high-concurrency point queries, consider enabling row storage (version 2.x).
    • Use PreparedStatement for queries.
  • [Mandatory] Specify bucket conditions when querying tables with over 100 million records.
  • [Optional] void using OR as a JOIN condition.
  • [Optional] To return some data after sorting a large amount of data (more than 500 million records), reduce the data range for sorting. Sorting a large amount of data affects query performance.

    For example, instead of FROM table ORDER BY datetime DESC LIMIT 10, use FROM table WHERE datetime='2023-10-20' ORDER BY datetime DESC LIMIT 10.

  • [Mandatory] Do not perform full-partition scan on partitioned tables.
  • [Optional] Do not frequently delete and modify data. Instead, delete data in batches occasionally with conditions to improve system stability and deletion efficiency.
  • [Optional] Pay attention to the following points when using parallel_fragment_exec_instance_num to optimize query task performance:

    This parameter determines the maximum number of fragments that can run simultaneously at the session level. Too many concurrent fragments will use up a significant number of CPU resources. You can leave this parameter blank. If you need to set this parameter to accelerate query speed, comply with the following rules:

    • Do not set this parameter to take effect globally, that is, do not use the set global command to set this parameter.
    • Set this parameter to an even number (2 or 4). The maximum value cannot exceed half of the number of CPU cores on a single node.
    • Check the CPU usage before you set the parameter. You can set this parameter only when the CPU usage is less than 50%.
    • If you use insert into select to insert a large amount of data, do not set this parameter.