Updated on 2023-03-29 GMT+08:00

Optimizing Write Performance

Before using a CSS cluster, you are advised to optimize the write performance of the cluster to improve efficiency.

Data Write Process

Figure 1 Data write process

The process of writing data from a client to Elasticsearch is as follows:

  1. The client sends a data write request to Node1. Here Node1 is the coordinator node.
  2. Node1 routes the data to shard 2 based on the _id of the data. In this case, the request is forwarded to Node3 and the write operation is performed.
  3. After data is written to the primary shard, the request is forwarded to the replica shard of Node2. After the data is written to the replica, Node3 reports the write success to the coordinator node, and the coordinator node reports it to the client.

An index in Elasticsearch consists of one or more shards. Each shard contains multiple segments, and each segment is an inverted index.

Figure 2 Elasticsearch index composition

When a document is inserted into Elasticsearch, the document is first written to the buffer and then periodically refreshed from the buffer to the segment. The refresh frequency is specified by the refresh_interval parameter. By default, data is refreshed every second.

Figure 3 Process of inserting a document into Elasticsearch

Improving Write Performance

In the Elasticsearch data write process, the following solutions can be used to improve performance:

Table 1 Improving write performance

No.

Solution

Description

1

Use SSDs or improve cluster configurations.

Using SSDs can greatly speed up data write and merge operations. For CSS, you are advised to select the ultra-high I/O storage or ultra-high I/O servers.

2

Use Bulk APIs.

The client writes data in batches. You are advised to write 1 MB to 10 MB data in each batch.

3

Randomly generate _id.

If _id is specified, a query operation will be triggered before data is written, affecting data write performance. In scenarios where data does not need to be retrieved using _id, you are advised to use a randomly generated _id.

4

Set a proper number of segments.

You are advised to set the number of shards to a multiple of the number of cluster data nodes. Ensure each shard is smaller than 50 GB.

5

Close replicas.

Data write and query are performed in off-peak hours. Close data copies during writing and open them afterwards.

The command for disabling replicas in Elasticsearch 7.x is as follows:

PUT {index}/_settings
{
  "number_of_replicas": 0
}

6

Adjust the index refresh frequency.

During batch data writing, you can set refresh_interval to a large value or -1 (indicating no refresh), improving the write performance by reducing refresh.

In Elasticsearch 7.x, run the following command to set the update time to 15s:

PUT {index}/_settings
{
  "refresh_interval": "15s"
}

7

Change the number of write threads and the size of the write queue.

You can increase the number of write threads and the size of the write queue, or error code 429 may be returned for unexpected traffic peaks.

In Elasticsearch 7.x, you can modify the following parameters to optimize write performance: thread_pool.write.size and thread_pool.write.queue_size

8

Set a proper field type.

Specify the type of each field in the cluster, so that Elasticsearch will not regard the fields as a combination of keywords and texts, which unnecessarily increase data volume. Keywords are used for keyword search, and texts used for full-text search.

For the fields that do not require indexes, you are advised to set index to false.

In Elasticsearch 7.x, run the following command to set index to false for field1:

PUT {index}
{
  "mappings": {
    "properties": {
      "field1":{
        "type": "text",
        "index": false
      }
    }
  }
}

9

Optimize the shard balancing policy.

By default, Elasticsearch uses the load balance policy based on the disk capacity. If there are multiple nodes, especially if some of them are newly added, shards may be unevenly allocated on the nodes. To avoid such problems, you can set the index-level parameter routing.allocation.total_shards_per_node to control the distribution of index shards on each node. You can set this parameter in the index template, or modify the setting of an existing index to make the setting take effect.

Run the following command to modify the setting of an existing index:

PUT {index}/_settings
{
	"index": {
		"routing.allocation.total_shards_per_node": 2
	}
}