ClickHouse Data Query Rules

This topic describes the rules and suggestions for querying ClickHouse data.

Data Query Rules

[Rule] Avoid using SELECT *; query only the required fields to reduce machine load and improve query performance.
In OLAP analysis scenarios, wide tables can contain hundreds or thousands of columns, but typically only a few are needed for dimension and metric calculations. Since ClickHouse stores data by column, using SELECT * increases system pressure.
[Rule] Limit the amount of data returned by queries to save computing resources and reduce network overhead.
Returning large volumes of data can cause client-side issues like memory overflow. If ClickHouse is used at the frontend and large data queries are necessary, paginate the queries to reduce network bandwidth and computing resource usage.
[Rule] Join large tables with small tables for associated queries.
ClickHouse requires multi-table join models to be pre-processed into wide table models. If tables and dimension tables change frequently, use a join model to query the latest data in real-time. Always join a large table with a small table using a join condition. Small tables should range from millions to tens of millions of rows and be filtered based on conditions before joining.
[Suggestion] Use GLOBAL JOIN/IN instead of common JOIN.
ClickHouse converts distributed table queries into local table operations across all shards and then summarizes the results. The execution logic of JOIN and GLOBAL JOIN differs significantly, so use GLOBAL JOIN for querying distributed tables.
[Rule] Properly use partition fields and index fields in data tables.
- The MergeTree engine organizes data in partition directories, allowing partitions to skip irrelevant data files during queries, reducing data reading.
- It sorts data based on the index field and generates sparse indexes based on the index_granularity configuration, enabling quick data filtering and improving query performance.
[Suggestion] Specify the data query scope.
Use filter criteria and data query periods to narrow the query scope. For example, when querying a specified partition, specify the partition field to reduce the number of files scanned and improve query performance. For a large table with thousands of columns in 700 partitions, querying 70 million records in one partition takes hundreds of milliseconds, while scanning all partitions takes 1-2 seconds. The performance of the former is 20 times higher.
[Suggestion] Exercise caution when using the final query.
The FINAL keyword is used at the end of a query statement. For the ReplacingMergeTree engine, if data cannot be completely deduplicated, developers may use FINAL for real-time merge-on-read to ensure no duplicate data exists. Consider using the argMax function or other methods to avoid this issue.
[Suggestion] Use materialized views to accelerate queries.
In scenarios with fixed query modes, materialized views can aggregate data in advance, significantly improving performance compared to querying detail tables.
[Suggestion] Syntax verification is not performed when creating a materialized view. Errors occur only during data insertion or querying. Fully verify materialized views before bringing them online.