El contenido no se encuentra disponible en el idioma seleccionado. Estamos trabajando continuamente para agregar más idiomas. Gracias por su apoyo.

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ MapReduce Service/ Component Development Specifications/ ClickHouse/ ClickHouse Application Development Suggestions

ClickHouse Application Development Suggestions

Updated on 2024-08-30 GMT+08:00

Properly configure the maximum number of concurrent operations

ClickHouse has a high processing speed because it uses the parallel processing mechanism. Even if a query is performed, half of the CPU of the server is used by default. Therefore, the ClickHouse does not support high-concurrency query scenarios. The default maximum number of concurrent requests is 100. You can adjust this number as needed, but it should be no more than 200.

Deploy the load balancing component. The query is performed based on the load balancing component to prevent the performance from being affected due to heavy single-point query pressure

ClickHouse can connect to any node in the cluster for query. If the query is performed on one node, the node may be overloaded and the reliability is low. You are advised to use ClickHouseBalancer or other load balancing services to balance the query load and improve reliability.

Properly set the partition key, ensure that the number of partitions is less than 1000, and use the integer type for the partition field

  1. You are advised to use toYYYYMMDD (table field pt_d) as the partition key. The table field pt_d is of the date type.
  2. If hourly partitioning is required in the service scenario, use toYYYYMMDD (table field pt_d) and toYYYYMMDD (table field pt_h) as the joint partitioning key. toYYYYMMDD (table field pt_h) is an integer number of hours.
  3. If data needs to be stored for many years, you are advised to create partitions by month, for example, toYYYYMM (table field pt_d).
  4. Properly control the number of parts based on factors such as the data partition granularity, volume of data submitted in each batch, and data storage period.

During query, the most frequently used and most filtered fields are used as the primary keys. The fields are sorted in descending order of access frequency and dimension cardinality

Data is sorted and stored based on primary keys. When querying data, you can quickly filter data based on primary keys. Setting primary keys properly during table creation can greatly reduce the amount of data to be read and improve query performance. For example, if the service ID needs to be specified for all analysis, the service ID field can be used as the first field of the primary key.

Properly set the sparse index granularity based on service scenarios

The primary key index of ClickHouse is stored by using a sparse index. The default sampling granularity of the sparse index is 8192 rows, that is, one record is selected from every 8192 rows in the index file.

Suggestions:

  1. The smaller the index granularity is, the more effective the query in a small range is. This avoids the waste of query resources.
  2. The larger the index granularity is, the smaller the index file is, and the faster the index file is processed.
  3. If the table index granularity exceeds 1 billion, set this parameter to 16384. Otherwise, set this parameter to 8192 or a smaller value.

Local Table Creation Reference

Reference:

CREATE TABLE mybase_local.mytable
(
    `did` Int32,
    `app_id` Int32,
    `region` Int32,
    `pt_d` Date
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/mybase_local/mytable', '{replica}')
PARTITION BY toYYYYMMDD(pt_d)
ORDER BY (app_id, region)
SETTINGS index_granularity = 8192, use_minimalistic_part_header_in_zookeeper = 1;

Instructions:

  1. Select a table engine:

    ReplicatedMergeTree: MergeTree engine that supports the replica feature. It is the most commonly used engine.

  2. Table information registration path on ZooKeeper, which is used to distinguish different configurations in the cluster:

    /clickhouse/tables/{shard}/{databaseName}/{tableName}: {shard} indicates the shard name, {databaseName} indicates the database name, and {tableName} indicates the replicated table name.

  1. order by primary key field:

    The most frequently used and most filterable field is used as the primary key. The dimensions are sorted in ascending order of access frequency and dimension cardinality. It is recommended that the number of sorting fields be less than or equal to 4. Otherwise, the merge pressure is high. The sorting field cannot be null. If the sorting field is null, data conversion is required.

  1. partition by field

    The partition key cannot be null. If the field contains a null value, data conversion is required.

  1. Table-level parameter configuration:

    index_granularity: sparse index granularity. The default value is 8192.

    use_minimalistic_part_header_in_zookeeper: whether to enable the optimized storage mode of the new version for data storage in the ZooKeeper.

  1. For details about how to create a table, visit https://clickhouse.tech/docs/en/engines/table-engines/mergetree-family/mergetree/.

Distributed Table Creation Reference

Reference:
CREATE TABLE mybase.mytable AS mybase_local.mytable
ENGINE = Distributed(cluster_3shards_2replicas, mybase_local, mytable, rand());

Instructions:

  1. Name of the distributed table: mybase.mytable.
  2. Name of the local table: mybase_local.mytable.
  3. Use AS to associate the distributed table with the local table to ensure that the field definitions of the distributed table are the same as those of the local table.
  4. Parameter description of the distributed table engine:

    cluster_3shards_2replicas: name of a logical cluster.

    mybase_local: name of the database where the local table is located.

    mytable: local table name.

    rand(): (optional) sharding key, which can be the raw data (such as did) of a column in the table or the result of a function call, such as rand(). Note that data must be evenly distributed in this key. Another common operation is to use the hash value of a column with a large difference, for example, intHash64(user_id).

Select the minimum type that meets the requirements based on the fields in the service scenario table

Numeral type, such as UInt8/UInt16/UInt32/UInt64, Int8/Int16/Int32/Int64, Float32/Float64. The performance varies according to the length.

Perform data analysis based on large and wide tables. Do not join large tables. Convert distributed join queries into join queries of local tables to improve performance

The performance of ClickHouse distributed join is poor. You are advised to aggregate data into a wide table on the model side and then import the table to ClickHouse. Queries in distributed join mode are converted to join queries on local tables. This eliminates the transmission of a large volume of data between nodes and reduces the volume of data involved in the calculation of local tables. The service layer summarizes data based on the local join results of all shards. The performance is improved remarkably.

Properly set the part size

The min_bytes_to_rebalance_partition_over_jbod parameter indicates the minimum size of the part involved in automatic balancing and distribution among disks in a JBOD array. The value must be appropriately set.

If the value is smaller than max_bytes_to_merge_at_max_space_in_pool/1024, the ClickHouse server process fails to be started and unnecessary parts move between disks.

If the value of min_bytes_to_rebalance_partition_over_jbod is greater than that of max_data_part_size_bytes (maximum size of parts that can be stored on disks in one array), no part can meet the condition for automatic balancing.

Utilizamos cookies para mejorar nuestro sitio y tu experiencia. Al continuar navegando en nuestro sitio, tú aceptas nuestra política de cookies. Descubre más

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback