ClickHouse Data Writing Rules

This topic describes the rules and suggestions for writing ClickHouse data.

Data Write Rules

[Rule] Use external modules to ensure idempotence of data import.
ClickHouse does not support transactional guarantees for data writes. Use external import modules to ensure data idempotence. For example, if a batch fails to import, drop the data of the corresponding partition or clear the imported data, and re-import the data of the partition or batch.

[Rule] Write data with large batches and a low frequency.
Each time data is inserted into ClickHouse, one or more part files are generated. If there are too many data parts, the merge pressure increases, potentially causing various exceptions and affecting data insertion. It is recommended that 5,000 to 100,000 rows be written per batch. Adjust the number of rows based on the number of fields to reduce memory and CPU pressure on the node where data is written. Ensure no more than one insertion per second.
[Suggestion] Insert data into only one partition at a time.
If data belongs to different partitions, a part file is generated each time data from different partitions is inserted, increasing the total number of parts. It is recommended that data inserted in a batch belong to the same partition.
[Suggestion] Exercise caution when inserting data into distributed tables in batches.
- When data is written to a distributed table, it is distributed across all local tables in the cluster. Each local table receives 1/N of the total inserted data. Small batch sizes can lead to numerous data parts, increasing merge pressure and potentially affecting data insertion.
- Data Consistency: Data is initially written to the host of the distributed table and then asynchronously sent to the host of the local table for storage. There is no consistency check, so if the host of the distributed table fails, data may be lost.
- Writing data to a distributed table is slower than writing to a local table, as the disk and network I/O of the node handling the distributed table can become a performance bottleneck.
- The client inserting data cannot detect if the distributed table successfully forwards data to each shard. If forwarding fails, the client continuously retries, consuming CPU resources.
- Distributed table insertion is suitable only for data deduplication scenarios, where the sharding key forwards data to the same shard for subsequent deduplication queries.
[Suggestion] Exercise caution when performing delete and update operations.
Standard SQL update and delete operations are synchronous, meaning the server returns the execution result immediately. However, ClickHouse performs these operations asynchronously. When an update statement is executed, the server responds immediately, but the data change is queued, potentially leading to overwrites and compromising atomicity. For service scenarios requiring update and delete operations, use the ReplacingMergeTree, CollapsingMergeTree, and VersionedCollapsingMergeTree engines.
[Suggestion] Exercise caution when performing the optimize operation.
Generally, optimize rewrites tables. Perform this operation during off-peak hours to avoid consuming excessive I/O, memory, and CPU resources, which can slow down or disrupt service queries.
[Suggestion] Run the ALTER TABLE statement for each column individually.
- Exercise caution when doing delete, update, and mutation operations.
  The update and delete of standard SQL statements are synchronous operations. That is, the client needs to wait for the server to return the execution results (usually an int value). In contrast, the update and delete of ClickHouse are asynchronous operations. When an update statement is processed, the server immediately returns the request status: success or fail, while the operation is not complete. At that time, the update request is accepted and queued in the background. As a result, the operation may be overwritten, and atomicity of operations cannot be ensured.
  
  For scenarios that involve update and delete operations, you are advised to use the ReplacingMergeTree, CollapsingMergeTree, and VersionedCollapsingMergeTree engines. For details, see Table Engines.
- Try to avoid adding or deleting data columns.
  Plan the number of columns for future use, reserve enough columns to avoid a large number of alter table modify operations during service running in the production system. Otherwise, unpredictable performance problem and data inconsistency may occur.