ClickHouse Basic Principles
Introduction
ClickHouse is an open-source columnar database oriented to online analysis and processing. It is independent of the Hadoop big data system and features compression rate and fast query performance. In addition, ClickHouse supports SQL query and provides good query performance, especially the aggregation analysis and query performance based on large and wide tables. The query speed is one order of magnitude faster than that of other analytical databases.
The core functions of ClickHouse are as follows:
Comprehensive DBMS functions
- Data Definition Language (DDL): allows databases, tables, and views to be dynamically created, modified, or deleted without restarting services.
- Data Manipulation Language (DML): allows data to be queried, inserted, modified, or deleted dynamically.
- Permission control: supports user-based database or table operation permission settings to ensure data security.
- Data backup and restoration: supports data backup, export, import, and restoration to meet the requirements of the production environment.
- Distributed management: provides the cluster mode to automatically manage multiple database nodes.
Column-based storage and data compression
ClickHouse is a database that uses column-based storage. Data is organized by column. Data in the same column is stored together, and data in different columns is stored in different files.
During data query, columnar storage can reduce the data scanning range and data transmission size, thereby improving data query efficiency.
In a traditional row-based database system, data is stored in the sequence in Table 1:
row |
ID |
Flag |
Name |
Event |
Time |
---|---|---|---|---|---|
0 |
12345678901 |
0 |
name1 |
1 |
2020/1/11 15:19 |
1 |
32345678901 |
1 |
name2 |
1 |
2020/5/12 18:10 |
2 |
42345678901 |
1 |
name3 |
1 |
2020/6/13 17:38 |
N |
... |
... |
... |
... |
... |
In a row-based database, data in the same row is physically stored together. In a column-based database system, data is stored in the sequence in Table 2:
row: |
0 |
1 |
2 |
N |
---|---|---|---|---|
ID: |
12345678901 |
32345678901 |
42345678901 |
... |
Flag: |
0 |
1 |
1 |
... |
Name: |
name1 |
name2 |
name3 |
... |
Event: |
1 |
1 |
1 |
... |
Time: |
2020/1/11 15:19 |
2020/5/12 18:10 |
2020/6/13 17:38 |
... |
This example shows only the arrangement of data in a columnar database. Columnar databases store data in the same column together and data in different columns separately. Columnar databases are more suitable for online analytical processing (OLAP) scenarios.
Vectorized executor
ClickHouse uses CPU's Single Instruction Multiple Data (SIMD) to implement vectorized execution. SIMD is an implementation mode that uses a single instruction to operate multiple pieces of data and improves performance with data parallelism (other methods include instruction-level parallelism and thread-level parallelism). The principle of SIMD is to implement parallel data operations at the CPU register level.
Relational model and SQL query
ClickHouse uses SQL as the query language and provides standard SQL query APIs for existing third-party analysis visualization systems to easily integrate with ClickHouse.
In addition, ClickHouse uses a relational model. Therefore, the cost of migrating the system built on a traditional relational database or data warehouse to ClickHouse is lower.
Data sharding and distributed query
The ClickHouse cluster consists of one or more shards, and each shard corresponds to one ClickHouse service node. The maximum number of shards depends on the number of nodes (one shard corresponds to only one service node).
ClickHouse introduces the concepts of local table and distributed table. A local table is equivalent to a data shard. A distributed table itself does not store any data. It is an access proxy of the local table and functions as the sharding middleware. With the help of distributed tables, multiple data shards can be accessed by using the proxy, thereby implementing distributed query.
ClickHouse Applications
ClickHouse is short for Click Stream and Data Warehouse. It is initially applied to a web traffic analysis tool to analyze data warehouses based on page click event flows. Currently, ClickHouse is widely used in Internet advertising, app and web traffic analysis, telecommunications, finance, and Internet of Things (IoT) fields. It is applicable to business intelligence application scenarios and has a large number of applications and practices worldwide. For details, visit https://clickhouse.tech/docs/en/introduction/adopters/.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot