A Cluster is Unavailable Due to Heavy Load

Symptom

The cluster status is Unavailable. Click the cluster name to go to the cluster basic information page, choose Logs, and click the Log Search tab. The error message "OutOfMemoryError" and alarm "[gc][xxxxx] overhead spent [x.xs] collecting in the last [x.xs]" are displayed.

Figure 1 Out-of-memory caused by frequent garbage collection

Possible Causes

The cluster is overloaded due to a large number of queries or task stacking. When the heap memory is insufficient, tasks cannot be allocated and garbage collection is frequently triggered. As a result, the Elasticsearch process exits abnormally.

Procedure

If a cluster is overloaded for a long time, data write and query may be slow. You are advised to upgrade the node specifications, add new nodes, or scale out the node capacity. For details about how to upgrade the node specifications, increase the number of nodes, or expand the storage capacity of nodes, see Scaling Out a Cluster.

Check whether tasks are stacked in the cluster.
- Method 1: On the Dev Tools page of Kibana, run the following commands to check whether tasks are being delayed:
```
GET /_cat/thread_pool/write?v
```
```
GET /_cat/thread_pool/search?v
```
  If the value of queue is not 0, tasks are stacked.
```
node_name                    name   active queue rejected
css-0323-ess-esn-2-1         write       2   200     7662
css-0323-ess-esn-1-1         write       2   188     7660
css-0323-ess-esn-5-1         write       2   200     7350
css-0323-ess-esn-3-1         write       2   196     8000
css-0323-ess-esn-4-1         write       2   189     7753
```
- Method 2: In the cluster management list, click More > View Metric in the Operation column of the cluster. On the displayed page, check the total number of queued tasks in the search thread pool and write thread pool. If the number of queued tasks is not 0, tasks are being delayed.
  Figure 2 Queued Tasks of White Thread Pool
- If a large number of tasks are stacked in the cluster, perform the following steps to optimize the cluster:
  - On the Logs page of the cluster, if a large number of slow query logs exist before the node is out of memory, optimize the query statement based on the site requirements.
  - On the Logs page of the cluster, if the error message "Inflight circuit break" or "segment can't keep up" is displayed, the circuit breaker may be triggered due to heavy write pressure. If the amount of written data (write rate) increases sharply recently, you need to properly arrange the write peak time window based on service requirements.
- If no task is stacked in the cluster or the cluster is still unavailable after optimization, go to the next step to check whether the cluster is overloaded.
Check whether the cluster is overloaded.
In the cluster management list, click More > View Metric in the Operation column of the cluster. On the displayed page, view metrics related to the CPU and heap memory, such as average CPU usage and average JVM heap usage. If the average CPU usage exceeds 80% or the average JVM heap usage exceeds 70%, the cluster is under heavy pressure.
Figure 3 Avg. CPU Usage
- If the cluster is overloaded, reduce the client request sending rate or expand the cluster capacity.
- If the cluster pressure is not overloaded or the cluster is still unavailable after the request sending rate is reduced, go to the next step to check whether a large amount of cache exists in the cluster.
On the Dev Tools page of Kibana, run the following command to check whether the cluster's cache has cached a large number of requests:
```
GET /_cat/nodes?v&h=name,queryCacheMemory,fielddataMemory,requestCacheMemory
```
- If the value of queryCacheMemory, fielddataMemory, or requestCacheMemory in the output exceeds 20% of the heap memory, run the POST _cache/clear command to clear the cache. The cached data is generated during data query to speed up the query. After the cached data is cleared, the query latency may increase.
```
name                         queryCacheMemory fielddataMemory requestCacheMemory 
css-0323-ess-esn-1-1                    200mb           1.6gb              200mb          
```
  You can run the following command to query the maximum heap memory of each node:
```
GET _cat/nodes?v&h=name,ip,heapMax
```
  name indicates the node name and ip indicates the IP address of the node.
- If the cluster is still overloaded after the optimization, contact technical support.