ClickHouse Basic Principles

ClickHouse Overview

ClickHouse offers easy-to-use, flexible, and stable hosting services in the cloud. A data warehouse can be created in minutes for massive real-time data query and analysis, improving the overall efficiency of data value mining. By leveraging the massively parallel processing (MPP) architecture, ClickHouse can query data several times faster than traditional data warehouses.

ClickHouse Principles

ClickHouse is a columnar, distributed database management system (DBMS) that mainly applies to Online Analytical Processing (OLAP) workloads. It can generate analysis data reports in real time using SQL queries, and allows you to create tables and databases, load data, and run queries during runtime. ClickHouse is simple, reliable, and fault-tolerant. ClickHouse achieves high-performance data querying through a combination of features such as the MPP architecture, distributed computing, and historical storage.

Figure 1 ClickHouse architecture
Click to enlarge

**Table 1** Modules in the ClickHouse architecture
Item	Description
MergeTree	The MergeTree engine is the most commonly used storage engine in ClickHouse and applies to scenarios requiring efficient data insertion and query. Data is initially written in multiple parts and then merged into larger ones to optimize query performance. Partitioning, sorting keys, primary keys, and secondary indexes are supported.
Query engine	ClickHouse leverages the MPP architecture and vectorized execution engine to significantly improve the performance of complex analysis and query. After a task is delivered to a cluster, the execution process consists of three parallel levels: Parallel query on cluster nodes Collaborative computing between multiple replicas Multi-core CPU parallel computing within a node and parallel computing between independent operators
Storage engine	ClickHouse employs column-oriented storage to store data of the same type by column. This structure is suitable for analytical scenarios. It helps ClickHouse to efficiently read columns required by queries, reducing overhead and improving query performance.
Materialized view	A materialized view stores the results of a query as a physical table to optimize the query performance. By storing precomputed and complex query results, a materialized view eliminates the need for repeated computation. Unlike regular views, materialized views store query results persistently and can be directly accessed to improve query efficiency.
RBO	A rule-based optimizer (RBO) uses predefined logical rules and physical rules to convert and optimize query plans. These rules restructure the execution logic while ensuring the correctness of query results, significantly improving query performance.
CBO	A cost-based optimizer (CBO) is a core component of the query optimizer. It dynamically selects the optimal execution plan using analysis statistics. Compared with a RBO, a CBO can optimize complex queries more accurately, especially in key scenarios such as join policy selection, aggregation calculation optimization, and predicate pushdown.
Atomicity	Atomicity ensures that a transaction is treated as an indivisible work unit, meaning that either all of its operations are successfully committed, or none of them are applied, and the system is rolled back to its original state. There is no partial execution state.
DDL	Data Definition Language (DDL) is a standardized language used to create, modify, and manage database structures. It defines the logical framework of data entities (such as tables, views, and indexes) and data schema.
Index	ClickHouse optimizes query performance through an indexing mechanism. Based on the vertical pruning of column-oriented storage, ClickHouse supports partition keys, primary keys, and multiple secondary indexes. These index structures efficiently locate data blocks horizontally, quickly filter out irrelevant data blocks, and significantly reduce the disk scanning scope, thereby improving query efficiency.

Advantages

High performance: ClickHouse employs column-oriented storage. This means data of the same type is stored into the same column, bringing a higher data compression ratio. Generally, the compression ratio can reach 10:1, significantly reducing storage costs and read overhead, and improving query performance.
Replication mechanism: ClickHouse supports data replication using ZooKeeper and the ReplicatedMergeTree engine (of Replicated series). When creating a table, you can specify a storage engine and determine whether to replicate the table.
Easy-of-use: You can create a ClickHouse analysis cluster in minutes on the console. No underlying infrastructure management is needed, helping you focus on analyzing data value with complete SQL statements.
Superior performance: Queries are processed as quickly as possible by using distributed MPP architecture and all available hardware. The query efficiency is several times faster than traditional data warehouses and a single query can process up to terabytes of data per second.
Security and reliability: Your clusters are independently deployed in isolated VPCs for more secure data access.
Lower costs: Cost-effective devices on the cloud are used to build a cost-effective hosted ClickHouse cluster.

Parent topic: ClickHouse

Previous topic: ClickHouse

Next topic: ClickHouse Application Scenarios