Doris Data Query Rules
This topic describes the rules and suggestions for querying Doris data.
Data Query Rules
- [Mandatory] Modify the query to a subquery if the in conditions exceed 2,000.
- [Mandatory] Refrain from using the REST API for executing numerous SQL queries. It is intended solely for cluster maintenance.
- [Optional] When INSERT INTO SELECT statements exceed 100 million, divide them into smaller batches for execution. To expedite data import during idle cluster resources, adjust concurrency settings.
For example, setting parallel_fragment_exec_instance_num to 8, which is half the CPU cores on a single Backend (BE), is recommended.
- [Mandatory] For result sets larger than 50,000, utilize JDBC Catalog or the outfile method for export to prevent overloading Frontend (FE) resources and ensure cluster stability.
- Employ pagination (offset limit) with an ORDER BY clause for interactive queries.
- Consider outfile or export methods when exporting data to third parties.
- [Mandatory] Utilize Colocation Join for joining more than two tables with over 300 million records.
- [Mandatory] Avoid using SELECT * for querying large tables with hundreds of millions of records and specify the required fields instead.
- Use the SQL Block method to forbid broad queries.
- For high-concurrency point queries, consider enabling row storage (version 2.x).
- Use PreparedStatement for queries.
- [Mandatory] Specify bucket conditions when querying tables with over 100 million records.
- [Optional] void using OR as a JOIN condition.
- [Optional] Narrow down the data range before sorting large datasets (whose records exceeds 500 million) to enhance performance.
For example, instead of FROM table ORDER BY datetime DESC LIMIT 10, use FROM table WHERE datetime='2023-10-20' ORDER BY datetime DESC LIMIT 10.
- [Mandatory] Do not perform full-partition scan on partitioned tables.
- [Optional] Do not frequently delete and modify data. Instead, delete data in batches occasionally with conditions to improve system stability and deletion efficiency.
- [Optional] Pay attention to the following points when using parallel_fragment_exec_instance_num to optimize query task performance:
This parameter determines the maximum number of fragments that can run simultaneously at the session level. Too many concurrent fragments will use up a significant number of CPU resources. You can leave this parameter blank. If you need to set this parameter to accelerate query speed, comply with the following rules:
- Do not set this parameter to take effect globally, that is, do not use the set global command to set this parameter.
- Set this parameter to an even number (2 or 4). The maximum value cannot exceed half of the number of CPU cores on a single node.
- Check the CPU usage before you set the parameter. You can set this parameter only when the CPU usage is less than 50%.
- If you use insert into select to insert a large amount of data, do not set this parameter.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot