Updated on 2024-08-30 GMT+08:00

Data Query

This topic describes the rules and suggestions for querying Doris data.

Doris Data Query Rules

  • When you are using the data query code, retry the query and issue the query again if the query fails.
  • If the enumerated value of the constant in exceeds 1000, you must use a subquery.
  • Do not use the Statement Execution Action REST APIs to execute a large number of SQL queries. These interfaces are used only for cluster maintenance.
  • When dealing with query results exceeding 50,000 records, consider using either JDBC Catalog or the OUTFILE method to export the data. Otherwise, allowing a large amount of data to accumulate on the front end (FE) can impact cluster stability.
    • When performing interactive queries, export data using pagination with an offset limit. You can achieve this by using the ORDER BY command.
    • If data is exported for a third party, use outfile or export.
  • Utilize Colocation Join for joining more than two tables with over 300 million records.
  • Avoid using select * for querying large tables with hundreds of millions of records and specify the required fields instead.
    • Do not use select * when you use SQL Block.
    • For high-concurrency point queries, enable row-based storage (supported by Doris 2.x) and use PreparedStatement.
  • Bucketing conditions must be set for queries of tables of hundreds of millions records.
  • Do not perform full-partition scan on partitioned tables.

Doris Data Query Suggestions

  • When an INSERT INTO SELECT statement inserts over 100 million data records, divide them into smaller batches for execution.
  • Do not use OR as a JOIN condition.
  • Do not frequently delete and modify data. Instead, delete data in batches occasionally with conditions to improve system stability and deletion efficiency.
  • To return som data after sorting a large amount of data (more than 500 million records), reduce the data range for sorting. Sorting a large amount of data affects query performance. The following is an example:

    Instead of using from table order by datatime desc limit 10, use from table where datatime='2023-10-20' order by datatime desc limit 10.

  • Pay attention to the following points when using parallel_fragment_exec_instance_num to optimize query task performance:

    This parameter determines the maximum number of fragments that can run simultaneously at the session level. Too many concurrent fragments will use up a significant amount of CPU resources. You can leave this parameter blank. If you need to set this parameter to accelerate query speed, comply with the following rules:

    • Do not set this parameter to take effect globally, that is, do not use the set global command to set this parameter.
    • Set this parameter to an even number (2 or 4). The maximum value cannot exceed half of the number of CPU cores on a single node.
    • Check the CPU usage before you set the parameter. You can set this parameter only when the CPU usage is less than 50%.
    • If you use insert into select to insert a large amount of data, do not set this parameter.