A Cluster is Unavailable Due to Heavy Load
Symptom
The cluster status is Unavailable. Click the cluster name to go to the cluster basic information page, choose Logs, and click the Log Search tab. The error message "OutOfMemoryError" and alarm "[gc][xxxxx] overhead spent [x.xs] collecting in the last [x.xs]" are displayed.

Possible Causes
The cluster is overloaded due to a large number of queries or task stacking. When the heap memory is insufficient, tasks cannot be allocated and garbage collection is frequently triggered. As a result, the Elasticsearch process exits abnormally.
Procedure

If a cluster is overloaded for a long time, data writes and queries may be slow. You are advised to upgrade the node specifications, add new nodes, or scale out node capacities. For details about how to upgrade the node specifications, increase the number of nodes, or expand the storage capacity of nodes, see Scaling Out a Cluster.
- Check whether tasks are stacked in the cluster.
- Method 1: On the Dev Tools page of Kibana, run the following commands to check whether tasks are being delayed:
GET /_cat/thread_pool/write?v
GET /_cat/thread_pool/search?v
If the value of queue is not 0, tasks are stacked.
node_name name active queue rejected css-0323-ess-esn-2-1 write 2 200 7662 css-0323-ess-esn-1-1 write 2 188 7660 css-0323-ess-esn-5-1 write 2 200 7350 css-0323-ess-esn-3-1 write 2 196 8000 css-0323-ess-esn-4-1 write 2 189 7753
- Method 2: In the cluster list, find the target cluster, and click More > Monitoring Metrics in the Operation column. On the displayed page, check the total number of queued tasks in the search thread pool and write thread pool. If the number of queued tasks is not 0, tasks are being delayed.
Figure 2 Queued Tasks of White Thread Pool
- If a large number of tasks are stacked in the cluster, perform the following steps to optimize the cluster:
- Choose Logs > Log Search. If a large number of slow query logs has been generated before the node is out of memory, optimize the queries based on site requirements.
- Choose Logs > Log Search. If the error message "Inflight circuit break" or "segment can't keep up" can be found, the circuit breaker may have been triggered due to heavy write pressure. If the amount of written data (write rate) increases sharply recently, you need to properly arrange the write peak time window based on service requirements.
- If no task is stacked in the cluster or the cluster is still unavailable after optimization, go to the next step to check whether the cluster is overloaded.
- Method 1: On the Dev Tools page of Kibana, run the following commands to check whether tasks are being delayed:
- Check whether the cluster is overloaded.
In the cluster list, find the target cluster, and click More > Monitoring Metrics in the Operation column. On the displayed page, check metrics related to CPU and heap memory usage, such as average CPU usage and average JVM heap usage. If the average CPU usage exceeds 80% or the average JVM heap usage exceeds 70%, the cluster is under heavy pressure.Figure 3 Avg. CPU Usage
- If the cluster is overloaded, reduce the client request sending rate or expand the cluster capacity.
- If the cluster pressure is not overloaded or the cluster is still unavailable after the request sending rate is reduced, go to the next step to check whether a large amount of cache exists in the cluster.
- On the Dev Tools page of Kibana, run the following command to check whether the cluster's cache has cached a large number of requests:
GET /_cat/nodes?v&h=name,queryCacheMemory,fielddataMemory,requestCacheMemory
- If the value of queryCacheMemory, fielddataMemory, or requestCacheMemory in the output exceeds 20% of the heap memory, run the POST _cache/clear command to clear the cache. The cached data is generated during data query to speed up the query. After the cached data is cleared, the query latency may increase.
name queryCacheMemory fielddataMemory requestCacheMemory css-0323-ess-esn-1-1 200mb 1.6gb 200mb
You can run the following command to query the maximum heap memory of each node:
GET _cat/nodes?v&h=name,ip,heapMax
name indicates the node name and ip indicates the IP address of the node.
- If the cluster is still overloaded after the optimization, contact technical support.
- If the value of queryCacheMemory, fielddataMemory, or requestCacheMemory in the output exceeds 20% of the heap memory, run the POST _cache/clear command to clear the cache. The cached data is generated during data query to speed up the query. After the cached data is cleared, the query latency may increase.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot