Updated on 2026-01-09 GMT+08:00

Client Node Overload

Symptom

A client node overload in an Elasticsearch/OpenSearch cluster can cause the following issues:

  • Kibana (or OpenSearch Dashboards) and Cerebro become inaccessible.
  • Write and query latency increases significantly.
  • Monitoring data shows high CPU, memory, or JVM usage on the client nodes, and some nodes may disconnect from the cluster.

Possible Causes

  • Unbalanced load: Requests are unevenly distributed to the client nodes.
  • Abnormal or malicious traffic: Sudden traffic spikes from malicious IP addresses or due to abnormal services (such as crawlers and unthrottled write scripts) are overwhelming the client nodes.
  • Heavy query or write load: Large quantities of result sets from different shards may need to be merged for complex queries, leading to CPU bottlenecks. (This can be common in the case of unoptimized aggregation queries.) Or the client nodes may be overwhelmed in the face of sustained high-concurrency writes.
  • Exhaustion of JVM resources: Continuously high JVM memory usage triggers frequent garbage collection (GC).

Solutions

Scenario 1: Unbalanced load (some of the nodes are overloaded)
  • Solution 1: Modify the connection settings by including all client node IP addresses to ensure balanced load across all the nodes.
  • Solution 2: Use ELB to access your CSS cluster and improve the cluster's availability and performance through load balancing.

Scenario 2: Abnormal traffic or heavy load (CPU overload)

  • Solution 1: Disallow or throttle IP addresses with abnormal traffic.
    1. Identify IP addresses with abnormal traffic.

      On the cluster's Intelligent O&M page, select Intelligent Diagnostics, and check Client Connection Check (Source IP Address Analysis) to identify abnormal IP addresses.

    2. Forbid or throttle these IP addresses.

      The following is a command for your reference (supported only in Elasticsearch 7.6.2, Elasticsearch 7.10.2, and OpenSearch 2.19.0):

      PUT /_cluster/settings  
      {  
        "persistent": {  
          "flowcontrol.http.enabled": true,  
          "flowcontrol.http.deny": "192.168.1.100,192.168.1.101" // Replace it with the IP address generating abnormal traffic.
        }  
      }  
  • Solution 2: If it is not practical to disallow or throttle IP addresses with abnormal traffic, temporarily close problematic indexes instead.

    Closing an index will make it completely inaccessible for both read and write operations.

    1. Identify indexes responsible for abnormal resource usage.

      Run the following command, and in the command output, look for records in which action is indices:data/read/search, and identify indexes whose running_time is abnormal:

      GET _cat/tasks?v&detailed=true

      In the example output shown below, index_20251210 is showing abnormal resource usage.

      action                         task_id                        parent_task_id                 type      start_time    timestamp running_time ip             node                 description
      cluster:monitor/tasks/lists    16oEhzTLSxOmpOAKcDWiRA:4405923 -                              transport 1765329559256 01:37:39  84micros    192.168.77.164  css-test-ess-esn-3-1 
      indices:data/read/search       GbveasILSxOutqupHepB4Q:1511246 -                              transport 1765329590043 01:19:50  1.7h        192.168.127.133 css-test-ess-client-esn-1-1 indices[index_20251210], types[], search_type[QUERY_THEN_FETCH], source[{"size":10000,"query":{"ids":{"values":["99998-786522498053054743"],"boost":1.0}}}]
    2. Temporarily close the abnormal index.
      POST /index_20251210/_close  
  • Solution 3: Add more client nodes or upgrade their specifications. If the ELB service is used, restart it for the node changes to take effect.

Scenario 3: Exhaustion of JVM resources (JVM overload)

  1. Restart client nodes that are unresponsive.
  2. Wait for 5 minutes before checking JVM usage.
  3. If the load remains high, handle it by using the same methods used for the CPU overload situation (disallowing or throttling abnormal IP addresses, increasing capacity, etc).