ClickHouse Application Development Suggestions

Properly configure the maximum number of concurrent operations

ClickHouse has a high processing speed because it uses the parallel processing mechanism. Even if a query is performed, half of the CPU of the server is used by default. Therefore, the ClickHouse does not support high-concurrency query scenarios. The default maximum number of concurrent requests is 100. You can adjust this number as needed, but it should be no more than 200.

Deploy the load balancing component. The query is performed based on the load balancing component to prevent the performance from being affected due to heavy single-point query pressure

ClickHouse can connect to any node in the cluster for query. If the query is performed on one node, the node may be overloaded and the reliability is low. You are advised to use ClickHouseBalancer or other load balancing services to balance the query load and improve reliability.

Properly set the partition key, ensure that the number of partitions is less than 1000, and use the integer type for the partition field

You are advised to use toYYYYMMDD (table field pt_d) as the partition key. The table field pt_d is of the date type.
If hourly partitioning is required in the service scenario, use toYYYYMMDD (table field pt_d) and toYYYYMMDD (table field pt_h) as the joint partitioning key. toYYYYMMDD (table field pt_h) is an integer number of hours.
If data needs to be stored for many years, you are advised to create partitions by month, for example, toYYYYMM (table field pt_d).
Properly control the number of parts based on factors such as the data partition granularity, volume of data submitted in each batch, and data storage period.

During query, the most frequently used and most filtered fields are used as the primary keys. The fields are sorted in descending order of access frequency and dimension cardinality

Data is sorted and stored based on primary keys. When querying data, you can quickly filter data based on primary keys. Setting primary keys properly during table creation can greatly reduce the amount of data to be read and improve query performance. For example, if the service ID needs to be specified for all analysis, the service ID field can be used as the first field of the primary key.

Properly set the sparse index granularity based on service scenarios

The primary key index of ClickHouse is stored by using a sparse index. The default sampling granularity of the sparse index is 8192 rows, that is, one record is selected from every 8192 rows in the index file.

Suggestions:

The smaller the index granularity is, the more effective the query in a small range is. This avoids the waste of query resources.
The larger the index granularity is, the smaller the index file is, and the faster the index file is processed.
If the table index granularity exceeds 1 billion, set this parameter to 16384. Otherwise, set this parameter to 8192 or a smaller value.

Local Table Creation Reference

Reference:

CREATE TABLE mybase_local.mytable
(
    `did` Int32,
    `app_id` Int32,
    `region` Int32,
    `pt_d` Date
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/mybase_local/mytable', '{replica}')
PARTITION BY toYYYYMMDD(pt_d)
ORDER BY (app_id, region)
SETTINGS index_granularity = 8192, use_minimalistic_part_header_in_zookeeper = 1;

Instructions:

Select a table engine:
ReplicatedMergeTree: MergeTree engine that supports the replica feature. It is the most commonly used engine.
Table information registration path on ZooKeeper, which is used to distinguish different configurations in the cluster:
/clickhouse/tables/{shard}/{databaseName}/{tableName}: {shard} indicates the shard name, {databaseName} indicates the database name, and {tableName} indicates the replicated table name.

order by primary key field:
The most frequently used and most filterable field is used as the primary key. The dimensions are sorted in ascending order of access frequency and dimension cardinality. It is recommended that the number of sorting fields be less than or equal to 4. Otherwise, the merge pressure is high. The sorting field cannot be null. If the sorting field is null, data conversion is required.

partition by field
The partition key cannot be null. If the field contains a null value, data conversion is required.

Table-level parameter configuration:
index_granularity: sparse index granularity. The default value is 8192.

use_minimalistic_part_header_in_zookeeper: whether to enable the optimized storage mode of the new version for data storage in the ZooKeeper.

For details about how to create a table, visit https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/.

Distributed Table Creation Reference

Reference:

CREATE TABLE mybase.mytable AS mybase_local.mytable
ENGINE = Distributed(cluster_3shards_2replicas, mybase_local, mytable, rand());

Instructions:

Name of the distributed table: mybase.mytable.
Name of the local table: mybase_local.mytable.
Use AS to associate the distributed table with the local table to ensure that the field definitions of the distributed table are the same as those of the local table.
Parameter description of the distributed table engine:
cluster_3shards_2replicas: name of a logical cluster.

mybase_local: name of the database where the local table is located.

mytable: local table name.

rand(): (optional) sharding key, which can be the raw data (such as did) of a column in the table or the result of a function call, such as rand(). Note that data must be evenly distributed in this key. Another common operation is to use the hash value of a column with a large difference, for example, intHash64(user_id).

Select the minimum type that meets the requirements based on the fields in the service scenario table

Numeral type, such as UInt8/UInt16/UInt32/UInt64, Int8/Int16/Int32/Int64, Float32/Float64. The performance varies according to the length.

Perform data analysis based on large and wide tables. Do not join large tables. Convert distributed join queries into join queries of local tables to improve performance

The performance of ClickHouse distributed join is poor. You are advised to aggregate data into a wide table on the model side and then import the table to ClickHouse. Queries in distributed join mode are converted to join queries on local tables. This eliminates the transmission of a large volume of data between nodes and reduces the volume of data involved in the calculation of local tables. The service layer summarizes data based on the local join results of all shards. The performance is improved remarkably.

Properly set the part size

The min_bytes_to_rebalance_partition_over_jbod parameter indicates the minimum size of the part involved in automatic balancing and distribution among disks in a JBOD array. The value must be appropriately set.

If the value is smaller than max_bytes_to_merge_at_max_space_in_pool/1024, the ClickHouse server process fails to be started and unnecessary parts move between disks.

If the value of min_bytes_to_rebalance_partition_over_jbod is greater than that of max_data_part_size_bytes (maximum size of parts that can be stored on disks in one array), no part can meet the condition for automatic balancing.

Parent topic: ClickHouse

Previous topic: ClickHouse Application Development Rules

Next topic: Doris