A Cluster is Unavailable Due to Heavy Load
Symptom
The cluster status is Unavailable. Click the cluster name to go to the cluster basic information page, choose Logs, and click the Log Search tab. The error message "OutOfMemoryError" and alarm "[gc][xxxxx] overhead spent [x.xs] collecting in the last [x.xs]" are displayed.
Possible Causes
The cluster is overloaded due to a large number of queries or task stacking. When the heap memory is insufficient, tasks cannot be allocated and garbage collection is frequently triggered. As a result, the Elasticsearch process exits abnormally.
Procedure
If a cluster is overloaded for a long time, data write and query may be slow. You are advised to upgrade the node specifications, add new nodes, or scale out the node capacity. For details about how to upgrade the node specifications, increase the number of nodes, or expand the storage capacity of nodes, see Scaling Out a Cluster.
- Check whether tasks are stacked in the cluster.
- Method 1: On the Dev Tools page of Kibana, run the following commands to check whether tasks are being delayed:
GET /_cat/thread_pool/write?v
GET /_cat/thread_pool/search?v
If the value of queue is not 0, tasks are stacked.
node_name name active queue rejected css-0323-ess-esn-2-1 write 2 200 7662 css-0323-ess-esn-1-1 write 2 188 7660 css-0323-ess-esn-5-1 write 2 200 7350 css-0323-ess-esn-3-1 write 2 196 8000 css-0323-ess-esn-4-1 write 2 189 7753
- Method 2: In the cluster management list, click More > View Metric in the Operation column of the cluster. On the displayed page, check the total number of queued tasks in the search thread pool and write thread pool. If the number of queued tasks is not 0, tasks are being delayed.
Figure 2 Queued Tasks of White Thread Pool
- If a large number of tasks are stacked in the cluster, perform the following steps to optimize the cluster:
- On the Logs page of the cluster, if a large number of slow query logs exist before the node is out of memory, optimize the query statement based on the site requirements.
- On the Logs page of the cluster, if the error message "Inflight circuit break" or "segment can't keep up" is displayed, the circuit breaker may be triggered due to heavy write pressure. If the amount of written data (write rate) increases sharply recently, you need to properly arrange the write peak time window based on service requirements.
- If no task is stacked in the cluster or the cluster is still unavailable after optimization, go to the next step to check whether the cluster is overloaded.
- Method 1: On the Dev Tools page of Kibana, run the following commands to check whether tasks are being delayed:
- Check whether the cluster is overloaded.
In the cluster management list, click More > View Metric in the Operation column of the cluster. On the displayed page, view metrics related to the CPU and heap memory, such as average CPU usage and average JVM heap usage. If the average CPU usage exceeds 80% or the average JVM heap usage exceeds 70%, the cluster is under heavy pressure.Figure 3 Avg. CPU Usage
- If the cluster is overloaded, reduce the client request sending rate or expand the cluster capacity.
- If the cluster pressure is not overloaded or the cluster is still unavailable after the request sending rate is reduced, go to the next step to check whether a large amount of cache exists in the cluster.
- On the Dev Tools page of Kibana, run the following command to check whether the cluster's cache has cached a large number of requests:
GET /_cat/nodes?v&h=name,queryCacheMemory,fielddataMemory,requestCacheMemory
- If the value of queryCacheMemory, fielddataMemory, or requestCacheMemory in the output exceeds 20% of the heap memory, run the POST _cache/clear command to clear the cache. The cached data is generated during data query to speed up the query. After the cached data is cleared, the query latency may increase.
name queryCacheMemory fielddataMemory requestCacheMemory css-0323-ess-esn-1-1 200mb 1.6gb 200mb
You can run the following command to query the maximum heap memory of each node:
GET _cat/nodes?v&h=name,ip,heapMax
name indicates the node name and ip indicates the IP address of the node.
- If the cluster is still overloaded after the optimization, contact technical support.
- If the value of queryCacheMemory, fielddataMemory, or requestCacheMemory in the output exceeds 20% of the heap memory, run the POST _cache/clear command to clear the cache. The cached data is generated during data query to speed up the query. After the cached data is cleared, the query latency may increase.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot