Optimizing the Write Performance of Elasticsearch and OpenSearch Clusters
Before using an Elasticsearch or OpenSearch cluster in CSS, you are advised to optimize the cluster's write performance to improve efficiency.
Data Write Process
Figure 1 shows how a client writes data to an Elasticsearch or OpenSearch cluster. In the preceding figure, P indicates the primary shard, and R indicates the replica shard. The primary and replica shards are randomly allocated in data nodes, but cannot be in the same node.
- The client sends a data write request to Node1. Here Node1 is the coordinator node.
- Node1 routes the data to shard 2 based on the _id of the data. In this case, the request is forwarded to Node3 and the write operation is performed.
- After data is written to the primary shard, the request is forwarded to the replica shard of Node2. After the data is written to the replica, Node3 reports the write success to the coordinator node, and the coordinator node reports it to the client.
An index in Elasticsearch consists of one or more shards. Each shard contains multiple segments, and each segment is an inverted index.
As shown in Figure 3, when a document is inserted into Elasticsearch, it is written to Buffer and Translog, and then periodically refreshed to Segment. The refresh frequency is specified by the refresh_interval parameter. By default, data is refreshed every second. For more information about write performance, see Near Real-Time Search.
Improving Write Performance
In the Elasticsearch data write process, the following solutions can be used to improve performance:
Solution |
Description |
---|---|
Use SSDs or improve cluster configurations. |
Using SSDs can greatly speed up data write and merge operations. For CSS, you are advised to select the ultra-high I/O storage or ultra-high I/O servers. |
Use Bulk APIs. |
The client writes data in batches. You are advised to write 1 MB to 10 MB data in each batch. |
Randomly generate _id. |
If _id is specified, a query operation will be triggered before data is written, affecting data write performance. In scenarios where data does not need to be retrieved using _id, you are advised to use a randomly generated _id. |
Set a proper number of segments. |
You are advised to set the number of shards to a multiple of the number of cluster data nodes. Ensure each shard is smaller than 50 GB. |
Close replicas. |
Data write and query are performed in off-peak hours. Close data copies during writing and open them afterwards. The command for disabling replicas in Elasticsearch 7.x is as follows: PUT {index}/_settings
{
"number_of_replicas": 0
} |
Adjust the index refresh frequency. |
During batch data writing, you can set refresh_interval to a large value or -1 (indicating no refresh), improving the write performance by reducing refresh. In Elasticsearch 7.x, run the following command to set the update time to 15s: PUT {index}/_settings { "refresh_interval": "15s" } |
Change the number of write threads and the size of the write queue. |
You can increase the number of write threads and the size of the write queue, or error code 429 may be returned for unexpected traffic peaks. In Elasticsearch 7.x, you can modify the following parameters to optimize write performance: thread_pool.write.size and thread_pool.write.queue_size |
Set a proper field type. |
Specify the type of each field in the cluster, so that Elasticsearch will not regard the fields as a combination of keywords and texts, which unnecessarily increase data volume. Keywords are used for keyword search, and texts used for full-text search. For the fields that do not require indexes, you are advised to set index to false. In Elasticsearch 7.x, run the following command to set index to false for field1: PUT {index}
{
"mappings": {
"properties": {
"field1":{
"type": "text",
"index": false
}
}
}
} |
Optimize the shard balancing policy. |
By default, Elasticsearch uses the load balance policy based on disk capacity. If there are multiple nodes, especially if some of them are newly added, shards may be unevenly allocated on the nodes. To avoid such problems, you can set the index-level parameter routing.allocation.total_shards_per_node to control the distribution of index shards on each node. You can set this parameter in the index template, or modify the setting of an existing index to make the setting take effect. Run the following command to modify the setting of an existing index: PUT {index}/_settings
{
"index": {
"routing.allocation.total_shards_per_node": 2
}
} |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot