Updated on 2024-05-11 GMT+08:00

Data Query

This section describes the rules and suggestions for querying Doris data.

Doris data query rules

  • In the data query service code, you are advised to retry the query when the query fails and issue the query again.
  • If the enumerated value of the constant in exceeds 1000, the constant must be changed to a subquery.
  • Do not use REST API (Statement Execution Action) to execute a large number of SQL queries. This interface is used only for cluster maintenance.
  • If the number of query results exceeds 50,000, use JDBC Catalog or OUTFILE to export the query data. Otherwise, a large amount of data on the FE will occupy FE resources, affecting cluster stability.
    • For interactive query, you are advised to export data in pagination mode (offset limit). The pagination command is Order by.
    • If data is exported for a third party, the outfile or export mode is recommended.
  • Colocation Join is used for JOIN of more than two tables with more than 300 million records.
  • Do not use select * to query data in hundreds of millions of large tables. Specify the fields to be queried during query.
    • Use the SQL Block mode to forbid the select * operation.
    • For high-concurrency point queries, you are advised to enable row-based storage (supported by Doris 2.x) and use PreparedStatement for queries.
  • Bucket conditions must be set for querying hundreds of millions of tables.
  • Do not perform full-partition data scanning on a partitioned table.

Doris Data Query Suggestions

  • If the number of insert into select statements exceeds 100 million, you are advised to split the statements into multiple insert into select statements and execute them in multiple batches.
  • Do not use OR as a JOIN condition.
  • You are not advised to frequently delete and modify data. You can save the data to be deleted in batches and delete the data in batches occasionally. In addition, you need to specify conditions to improve system stability and deletion efficiency.
  • Some data is returned after a large amount of data (more than 500 million) is sorted. You are advised to reduce the data range before performing the sorting. Otherwise, the performance will be affected if a large amount of data is sorted. The following is an example.

    Instead of from table order by datatime desc limit 10, use from table where datatime='2023-10-20' order by datatime desc limit 10.

  • Pay attention to the following points when using parallel_fragment_exec_instance_num to query task performance optimization parameters:

    This parameter is set at the session level and indicates the number of fragments that can be concurrently executed. This parameter consumes a large number of CPU resources. Therefore, you do not need to set this parameter. If you need to set this parameter to accelerate query performance, comply with the following rules:

    • Do not set this parameter to take effect globally. Do not use the set global command to set this parameter.
    • You are advised to set this parameter to an even number 2 or 4. The maximum value cannot exceed half of the number of CPU cores on a single node.
    • When setting this parameter, you need to observe the CPU usage. You can set this parameter only when the CPU usage is less than 50%.
    • If the query SQL statement is insert into select with a large amount of data, you are advised not to set this parameter.