Elasticsearch Cluster Planning Suggestions_Cloud Search Service

Planning Cluster AZs

By deploying a CSS cluster across multiple AZs, you can increase the cluster's availability, lower the likelihood of data loss, and minimize service downtime. You can select two or three different AZs in the same region to deploy a cluster.

If you select two or three AZs when creating a cluster, CSS automatically enables cross-AZ HA, ensuring cluster nodes distribute evenly across these AZs. For details about the distribution of nodes in different quantities across AZs, see Table 1.

When creating a multi-AZ cluster, ensure that the number of selected nodes of any type is no less than the number of AZs. Otherwise, multi-AZ cluster deployment will fail.
When a multi-AZ cluster is deployed, nodes of all types are evenly distributed across different AZs. The difference between node quantities in different AZs does not exceed 1.
If the number of data nodes plus cold data nodes in a cluster is not an integer multiple of the number of AZs, data in the cluster may be unevenly distributed, affecting data query or write performance.

**Table 1** Node quantities and AZ distribution
Node Quantity	Single AZ	Two AZs		Three AZs
Node Quantity	AZ1	AZ1	AZ2	AZ1	AZ2	AZ3
1	1	Not supported		Not supported
2	2	1	1	Not supported
3	3	2	1	1	1	1
4	4	2	2	2	1	1
...	...	...	...	...	...	...

In the case of a multi-AZ deployment, configure the number of replicas in a manner that can better capitalize on the high availability that comes with such as deployment.

In a two-AZ deployment, if one AZ becomes unavailable, the other AZ continues to provide services. In this case, at least one replica is required. Elasticsearch uses one replica by default. You can retain the default value if you do not require higher read performance.
In the case of a three-AZ deployment, if one AZ becomes unavailable, the other AZs can continue to provide services. In this case, at least one replica is required. Elasticsearch uses one replica by default. If you need more replicas to improve the cluster's ability to handle queries, modify the replica setting to set more replicas.
For example, you can run the following command to set the number of index replicas:

curl -XPUT http://ip:9200/{index_name}/_settings -d '{"number_of_replicas":2}'

Alternatively, run the following command to specify the number of replicas in the index template:

curl -XPUT http://ip:9200/ _template/templatename -d '{ "template": "*", "settings": {"number_of_replicas": 2}}'

where, ip indicates the private IP address of the cluster, index_name indicates the index name, and number_of_replicas indicates the number of index replicas to change to. In this example, the number of index replicas is changed to 2.

Table 2 describes the service outage patterns for different variations of a multi-AZ deployment facing the failure of a single AZ.

**Table 2** Possible service outage patterns in the face of the failure of a single AZ
Number of AZs	Number of Master Nodes	Service Outage Patterns and Handling Suggestions
2	0	If the number of nodes is an even number: If half of the data nodes are faulty, you need to replace one node in the faulty AZ before a master node can be selected. If the number of nodes is an odd number: If the faulty AZ contains one more node than the normal AZ, you need to replace one node in the faulty AZ before a master node can be selected. For how to replace nodes, contact technical support. If the faulty AZ contains one less node than the normal AZ, services will not be interrupted and a master node can be selected.
2	3	There is a 50% chance of service interruption. When two dedicated master nodes are allocated to one AZ and another master node is allocated to the other AZ: If service interruption happens in the AZ with one master node, a master node can be selected from the AZ that has two dedicated master nodes. If service interruption happens in the AZ with two dedicated master nodes, master nodes cannot be selected because the remaining AZ has only one dedicated master node. In this case, services will be interrupted and you need to contact technical support.
3	0	If you have three AZs and four nodes, one AZ will have two nodes, and the other two will each have one node. If the AZ with two nodes becomes faulty, services will be interrupted. Therefore, you are advised not to configure four nodes when selecting three AZs. There is a small chance of service interruption if you follow this advice.
3	3	Service interruption does not occur.

You can switch AZs for an existing cluster. For details, see Switching AZs for an Elasticsearch Cluster.

You can Add AZ or Migrate AZ.

Add AZ: Add one or two AZs to a single-AZ cluster, or add an AZ to a dual-AZ cluster to improve cluster availability.
Migrate AZ: Completely migrate data from the current AZ to another AZ that has sufficient resources.

Planning the Cluster Version

When selecting an Elasticsearch cluster version, consider factors such as service requirements, available features, performance, security updates, and long-term support, ensuring that the selected version can meet both current and future needs and provide a stable, secure environment for your data.

If you are deploying the Elasticsearch clusters of CSS for the first time, you are advised to use the latest version.
If you are migrating an in-house built or third-party Elasticsearch cluster to CSS without changing the cluster, keep the version of the source cluster.
If you are migrating an in-house built or third-party Elasticsearch cluster to CSS while recoding it, choose Elasticsearch 7.10.2 or 7.6.2.

**Table 3** Cluster version support
Feature	Elasticsearch 7.6.2	Elasticsearch 7.10.2	Details
Vector search	√	√	Configuring Vector Search for Elasticsearch Clusters
Storage-compute decoupling	√	√	Configuring Storage-Compute Decoupling for an Elasticsearch Cluster
Flow Control 2.0	√	√	Configuring Flow Control 2.0 for an Elasticsearch Cluster
Flow Control 1.0	√	√	Configuring Flow Control 1.0 for an Elasticsearch Cluster
Large query isolation	√	√	Configuring Large Query Isolation for an Elasticsearch Cluster
Enhanced aggregation	x	√	Configuring Enhanced Aggregation for an Elasticsearch Cluster
Read/write splitting	√	√	Configuring Read/Write Splitting Between Two Elasticsearch Clusters
Switchover between hot and cold storage	√	√	Switching Between Hot and Cold Storage for an Elasticsearch Cluster
Index recycle bin	x	√	Configuring an Index Recycle Bin for an Elasticsearch Cluster
Enhanced import performance	x	√	Enhancing the Data Import Performance of Elasticsearch Clusters
Enhanced cluster kernel monitoring	√	√	Configuring Kernel Monitoring for an Elasticsearch Cluster
Index monitoring	√	√	Configuring Index Monitoring for an Elasticsearch Cluster

Planning Node Types

For an Elasticsearch cluster, the proper planning of different types of nodes is critical to optimizing performance and resource utilization. Before creating a cluster, determine the types of nodes to use based on service requirements, query load, data growth patterns, and performance goals. Table 4 describes the characteristics of different node types and the purposes they are suited for.

If no master or client nodes were enabled when a cluster was created, you can add them if data nodes become overloaded later at some point. For details, see Adding Master or Client Nodes.
If no cold data nodes were enabled during cluster creation, they cannot be added later, so you have to determine whether to use cold data nodes while creating a cluster.

**Table 4** Characteristics and purposes of different types of nodes
Node Type	Node Description	Characteristics
Data node (ESS)	Data nodes are used to store data. In a cluster that has neither master nor client nodes, data nodes provide the functions of both types of nodes.	Data nodes are mandatory for any cluster. If Master node and Client node are both unselected, data nodes will be used for all of the following purposes: cluster management, data storage, cluster access, and data analysis. To ensure reliability, a cluster should have a least three nodes. If Master node is selected but Client node is not, data nodes will be used for data storage, cluster access, and data analysis. If Master node is unselected but Client node is selected, data nodes will be used for data storage and cluster management. If Master node and Client node are both selected, data nodes will be used for data storage only.
Master node (ess-master)	The master node is responsible for cluster management, such as metadata management, index creation and deletion, and shard allocation. It plays a critical role in metadata management, node management, stability guarantee, and cluster operation control for large-scale clusters.	Large-scale cluster: For a cluster that has more than 16 nodes, you are advised to add dedicated master nodes to effectively manage the cluster status and metadata. Large quantities of indexes and shards: If the number of indexes or shards exceeds 10,000, a master node will have better performance in handling complex cluster management tasks, avoiding impact on the performance of data nodes. Better management of cluster nodes: The master node maintains the cluster metadata, including index mapping, settings, and aliases. For a complex cluster structure, a dedicated master node offers better management, including node joining, exiting, and fault detection. The master node plays a critical role in cluster node management. Improved cluster stability and reliability: A dedicated master node improves cluster stability and reliability by taking over cluster management responsibilities from data storage and query nodes. Optimized performance for data nodes: By offloading cluster management tasks from data nodes to master nodes, you can allow data nodes to focus on data processing, which leads to improved performance.
Client node (ess-client)	Client nodes receive and coordinate external requests, such as search and write requests. They play an important role in handling high-load queries, complex aggregations, managing a large number of shards, and improving cluster scalability.	High QPS: In the face of a high queries per second (QPS), a dedicated client node can evenly distribute query requests, reducing the load of data nodes and improving the overall query performance. Complex aggregation queries: For complex, compute-intensive aggregation queries, a client node can dedicate to the handling of aggregation results, thus improving the efficiency and response speed of such queries. Large number of shards: In a cluster with a large number of shards, a client node can effectively coordinate and manage query requests to each shard, improving efficiency in request forwarding and processing. Reducing the load of data nodes: A client node parses search requests, determines the locations of index shards, and coordinates different nodes to execute searches. This reduces the load of data nodes by allowing them to focus on data storage and indexing. Improved cluster scalability: The use of client nodes allows for better cluster scalability and flexibility, enabling supporting for large datasets and more complex query requirements.
Cold data node (ess-cold)	Cold data nodes are used to store query latency-insensitive data in large quantities. They offer an effective way to manage large datasets and cut storage costs.	Storage of historical data in large quantities: Cold data nodes offer a more cost-effective solution for storing large quantities of historical data that are infrequently accessed but useful for analytical purposes. Optimizing hot data performance: By migrating cold data to cold data nodes, you reduce the storage load of hot data nodes, thereby optimizing their query and write performance. Insensitivity to query latency: Cold data nodes are a better option for storing data that is insensitive to a high query latency. Cost-effectiveness: Cold data nodes usually use large disks that offer inexpensive storage.

Planning Node Storage

Planning node models

CSS supports various ECS models suited for different application needs. Select the appropriate models based on service requirements and performance expectations to achieve a perfect balance between storage performance and costs.

**Table 5** Different node models and the intended application scenarios
Node Model	Disk Type	Specifications Description	Recommended Scenario
Computing-intensive	Cloud drive	vCPUs:Memory = 1:2	Small-volume searches (less than 100 GB on a single node).
General computing	Cloud drive	vCPUs:Memory = 1:4	Medium-scale e-commerce site search, social search, and log search, search and analysis where the data volume on a single node is in the range 100 GB to 1,000 GB.
Memory-optimized	Cloud drive	vCPUs:Memory = 1:8	Search and analysis where the data volume on a single node is in the range 100 GB to 2,000 GB. This type of node is a good option for vector search, as its large memory helps improve cluster performance and stability.
Disk-intensive	Local disk	Attached HDDs	Cold data storage, such as logs. Such data may need to be updated from time to time, and does not require a high query performance.
Ultra-high I/O (CPU architecture: Kunpeng)	Local disk	Attached SDDs	Large-scale log storage (hot data).
Ultra-High I/O (CPU architecture: x86)	Local disk	Attached SDDs	Large-scale search and analysis: High computing or disk I/O performance is required. Other use cases include public opinion analysis, patent search, and database acceleration.

Planning node specifications
Given the expected data handling capacities, it is always preferable to use a smaller number of nodes with larger specifications rather than a larger number of nodes with smaller specifications. For example, a cluster consisting of three nodes each with 32 CPU cores and 64 GB memory is usually better than a cluster consisting of 12 nodes each with 8 CPU cores and 16 GB memory in terms of stability and scalability.

The specific advantages are as follows:
- Cluster stability: High-specs nodes provide more powerful data processing capabilities and larger memory space, leading to higher overall cluster stability.
- Improved scalability: When a cluster consisting of high-specs nodes encounters a performance bottleneck, you simply add more of these high-specs nodes. This is easier than increasing the specifications of existing nodes.
- Easier maintenance: A smaller number of nodes means easier maintenance and less complex management.
In contrast, when a cluster consisting of low-specs nodes needs extra capacity, usually a vertical scale-up is performed, meaning to increase the specifications of existing nodes. This may entail not only more complex, challenging migration and upgrade processes, but also additional maintenance costs.

To sum up, when planning a cluster, you must fully consider performance, costs, maintenance, and scalability, and choose the node specifications that best suit your needs.
Planning storage capacity
When planning the storage capacity of a CSS cluster, consider the following factors: the original data size, number of data replicas, data bloat rate, and disk usage The following is a recommended formula for determining the needed cluster storage capacity.

Storage capacity = Original data size x (1 + Number of replicas) x (1 + Data bloat rate) x (1 + Ratio of reserved space)
- Original data size: Determine the size of the original data that needs to be stored.
- Number of replicas: The default value is 1.
- Data bloat rate: Extra data may be generated due to data indexing. Generally, you are advised to use a 25% data bloat rate.
- Disk usage: Considering the space occupied by the operating system and file system and the space reserved for optimized disk performance and redundancy, you are advised to keep the disk usage under 70%. That is, you need to reserve 30% of the total disk capacity.
A recommended formula is as follows: Cluster storage capacity = Original data size x 2 x 1.25 x 1.3

To put it simply, if the original data size is known, the total storage capacity of the cluster needs to be 3.25 times that. This formula is for quick reference only. You still need to adjust it based on the actual applications and projected data growth rate.

Planning the Node Quantity

Plan the node quantity based on performance requirements and predicted load. Table 6 provides a method for calculating the appropriate number of nodes. Following this method helps you ensure cluster performance and stability.

**Table 6** Calculating the number of cluster nodes
Node	Performance Baseline	Formula	Example
Write node	For a node that uses cloud disks, the write performance baseline of a single vCPU is 1 MB/s. For an ultra-high I/O node, the write performance baseline of a single vCPU is 1.5 MB/s.	Number of write nodes = Peak traffic/Number of vCPUs per node/Write throughput per vCPU x Number of replicas	If the peak inbound traffic is 100 MB/s and a node has 16 vCPUs and 64 GB memory, 12 nodes (100/16/1 x 2) are needed.
Query node	It is difficult to evaluate the performance baseline of a single node out of the context of specific application scenarios. The average query response time (in seconds) is used here to measure the query performance baseline.	Number of query nodes = QPS/(Number of vCPUs per node x 3/2/Average query response time in seconds) x Number of shards	If the query QPS is 1000, the average query response time is 100 ms (0.1s), three index shards are planned, and a node has 16 vCPUs and 64 GB memory, ~12 nodes (1000/(16 x 3/2/0.1) x 3) are needed.
Total number of nodes	N/A	Total number of nodes = Number of write nodes + Number of query nodes	Total number of nodes = Number of write nodes + Number of query nodes = 24
NOTE: Here, the total number of nodes refer to the number of data nodes plus that of cold data nodes.

In each cluster, the number of nodes supported by each node type varies, depending on the types of nodes used in that cluster. For details, see Table 7.

**Table 7** Number of nodes of different types allowed in a single cluster
Node Type	Node Quantity
ess	ess: 1-32
ess, ess-master	ess: 1-200 ess-master: an odd number ranging from 3 to 9
ess, ess-client	ess: 1-32 ess-client: 1-32
ess, ess-cold	ess: 1-32 ess-cold: 1-32
ess, ess-master, ess-client	ess: 1-200 ess-master: an odd number ranging from 3 to 9 ess-client: 1-32
ess, ess-master, ess-cold	ess: 1-200 ess-master: an odd number ranging from 3 to 9 ess-cold: 1-32
ess, ess-client, ess-cold	ess: 1-32 ess-client: 1-32 ess-cold: 1-32
ess, ess-master, ess-client, ess-cold	ess: 1-200 ess-master: an odd number ranging from 3 to 9 ess-client: 1-32 ess-cold: 1-32
NOTE: ess: data node, which is the default node type that is mandatory for cluster creation. The other three node types are optional. ess-master: master node ess-client: client node ess-cold: cold data node

Planning VPCs and Subnets

There are two types of VPCs: shared and non-shared.

Compared with a non-shared VPC, a shared VPC has the following advantages:

You can create resources in a VPC under one account and share the resources with other accounts. This way, the other accounts do not need to resources. Fewer resources and simplified network architecture improves management efficiency and reduces costs.
If there are VPCs under different accounts, VPC peering connections will be needed to connect these VPCs. With VPC sharing, different accounts can create resources within one VPC. This eliminates the need to create VPC peering connections and simplifies the network structure.
Resources can be centrally managed in one account, which helps enterprises configure security policies in a centralized manner and better monitor and audit resource usage for higher security.

If you choose to create a cluster using a shared VPC, make sure the VPC has already been created. For details, see Table 8. For details about constraints and procedures for using a shared VPC, see VPC Sharing.

**Table 8** Procedure for sharing a subnet
Method	Description	Operation Guide
Method A:	On the RAM console, the owner creates a resource share. Select a subnet to be shared. Select permissions to grant to principals on the shared subnet. To create a CSS cluster in a shared VPC, the default vpc subnet statement permission is required. Specify principals that will have access to the shared subnet. On the RAM console, principals accept or reject the resource share. If principals accept the resource share, they can use the shared subnet. If principals do not want to use the shared subnet, they can leave the resource share. If principals reject the resource share, they cannot use the subnet.	Creating a Resource Share Responding to a Resource Sharing Invitation Leaving a Resource Share
Method B:	On the RAM console, the owner creates a resource share. Select a subnet to be shared. Select permissions to grant to principals on the shared subnet. To create a CSS cluster in a shared VPC, the default vpc subnet statement permission is required. Specify principals that will have access to the shared subnet. On the VPC console, the owner creates a subnet and adds it to a resource share created earlier. On the RAM console, principals accept or reject the resource share. If principals accept the resource share, they can use the shared subnet. If principals do not want to use the shared subnet, they can leave the resource share. If principals reject the resource share, they cannot use the subnet.	Creating a Resource Share Sharing a Subnet with Other Accounts Responding to a Resource Sharing Invitation Leaving a Resource Share

Planning a Cluster's Security Mode

**Table 9** Cluster security modes
Cluster Type		Description	Characteristics
Non-security mode cluster	Cluster for which the security mode is disabled	With such a cluster, access to the cluster will require no user authentication, and data will be transmitted in plaintext using HTTP. Make sure the customer is in a secure environment, and do not expose the cluster access interface to the public network.	This type of cluster is mostly used for internal services and testing. Advantage: simple and easy to access. Disadvantage: poor security as anyone can access it.
Security-mode cluster	Cluster in security mode + HTTP	A security-mode cluster requires user authentication. It supports access control and data encryption, and it uses HTTP to transmit data in plaintext. Make sure the customer is in a secure environment, and do not expose the cluster access interface to the public network.	Access control by user permissions is supported. This type of cluster is suitable for workloads that are particularly performance-demanding. Advantage: User authentication improves cluster security. HTTP-based access ensures high performance of the cluster. Disadvantage: The cluster cannot be accessed from the public network.
Security-mode cluster	Cluster in security mode + HTTPS	A security-mode cluster requires user authentication. It supports access control and data encryption, and it uses HTTPS to encrypt communication and enhance data security.	This type of cluster is suitable where there is a high security standard and public network access is required. Advantage: User authentication improves cluster security, and HTTPS-based secure communication allows for secure public network access. Disadvantage: HTTPS encrypts nearly all information sent between server and client, causing a read performance loss of around 20%.

To access a security-mode cluster, you need to provide a username and password. CSS supports authentication for the following two types of users:

Administrator: The default administrator username is admin, and the password is the one specified during cluster creation.
Cluster user: created by the cluster administrator on Kibana. For details, see Creating Users for an Elasticsearch Cluster and Granting Cluster Access.

You can change the security mode of an existing cluster. For details, see Changing the Security Mode of an Elasticsearch Cluster.

You have many options when it comes to changing the security mode of a cluster: from non-security mode to security mode, from security mode to non-security mode, and switching between security modes using different web protocols (HTTP or HTTPS).

Planning the Number of Index Shards

Before importing data to a cluster, carefully consider your service needs and plan the cluster's data structure and distribution in advance. This includes properly designing indexes and deciding on the appropriate number of index shards. To ensure optimal performance and scalability for a cluster, consider following these best practices:

The size of a single shard: Keep the size of each shard between 10 GB and 50 GB. This helps strike a balance between storage efficiency and query performance.
Total number of shards in a cluster: To facilitate management and avoid an excessively large scale, make sure the total number of shards in a cluster is less than 30,000. This helps maintain the stability and responsiveness of the cluster.
Memory-to-shards ratio: Limit the number of shards per 1 GB of memory to 20 to 30. This ensures that each shard has sufficient memory resources to respond to indexing and query operations.
Number of shards per node: To prevent node overload, keep the number of shards on each node under 1000. This helps to improve node stability.
Relationship between the number of index shards and the number of nodes: For each index, make sure the number of shards is the same as or is an integral multiple of the number of nodes in the cluster. This helps improve load balancing and optimize query and indexing performance.

Following these suggestions, you can plan and manage index shards for a CSS cluster more effectively, improving the cluster's overall performance and maintainability.

Elasticsearch Cluster Planning Suggestions