Help Center/ Cloud Container Engine/ Best Practices/ Cluster/ Protecting a CCE Cluster Against Overload
Updated on 2025-10-30 GMT+08:00

Protecting a CCE Cluster Against Overload

Cluster overload occurs when a Kubernetes cluster's compute, storage, or network resources exceed its processing capacity, leading to exhaustion of key control plane components (like etcd and kube-apiserver) or worker nodes. This can severely degrade cluster performance or even cause operational failures. To prevent this, proactive overload protection is essential. Policies like overload control and LIST request optimization help maintain the stability of core services, ensuring that sudden spikes in loads do not result in service disruptions. This section explores the causes and impacts of cluster overload, explains CCE cluster overload control, and provides best practices for maintaining stability.

Causes of Cluster Overload

Cluster overload affects stability and service continuity within a Kubernetes environment. Common causes include:
  • Resource request exceeding cluster capacity: Core components on the cluster control plane include etcd and kube-apiserver. etcd is a backend database that stores all cluster data. kube-apiserver acts as the entry to the control plane and processes requests. kube-apiserver caches cluster data to lessen the burden on etcd. Other core components in the cluster also cache various resources and monitor changes to these resources. When the demand for compute, network, and storage resources surpasses the cluster capacity, these components become heavily loaded. If this load exceeds a certain threshold, the cluster becomes overloaded.
  • Lots of data queries (such as multiple LIST requests or a single request retrieving large amounts of data): If a client obtains pod data using field selectors, but kube-apiserver cache does not contain the requested information, the client must fetch the full pod data from etcd, since etcd cannot get data by field. Once retrieved, kube-apiserver deserializes the pod data into structured objects in memory, traverses each pod to match the requested fields, and returns the obtained serialized results. When many concurrent queries occur, resource utilization of each component increases sharply, leading to issues such as etcd latency spikes, OOM errors in kube-apiserver, and control loop imbalances. Consequently, the entire cluster becomes overloaded.
    Figure 1 Example of a large amount of data obtained from a client

Impacts of Cluster Overload

When a cluster experiences overload, Kubernetes API response delay increase, and resource utilization on control plane nodes rises, affecting both the control plane and related services. Below are key areas impacted:

  • Kubernetes resource management: Operations such as creating, deleting, updating, or obtaining Kubernetes resources may fail.
  • Kubernetes leader election: Overload can prevent a leader node lease renewal from completing on time. If the lease expires, the leader node loses its leadership role, triggering re-election. This can cause temporary service interruptions, task migrations, scheduling delays, and fluctuations in cluster performance.
  • Cluster management failures: Severe overload may make a cluster unavailable, preventing management operations like node creation or deletion.

CCE Overload Control

  • Overload control: CCE clusters have supported overload control since v1.23, which reduces the number of LIST requests outside the system when the control plane experiences high resource usage pressure. To use this function, enable overload control for your clusters. For details, see Enabling Overload Control for a Cluster.
  • Optimized processes on LIST requests: Starting from CCE clusters of v1.23.8-r0 and v1.25.3-r0, processes on LIST requests have been optimized. Even if a client does not specify the resourceVersion parameter, kube-apiserver responds to requests based on its cache to avoid additional etcd queries and ensure that the response data is up to date. Additionally, namespace indexes are now added to the kube-apiserver cache. This allows clients to request specified resources in a namespace without needing to fetch full data for that namespace. This effectively reduces the response delay and control plane memory overhead.
  • Refined traffic limit policy on the server: The API Priority and Fairness (APF) feature is used to implement fine-grained control on concurrent requests. For details, see API Priority and Fairness.

Configuration Suggestions

When running services in Kubernetes clusters, the cluster performance and availability are influenced by various factors, including the cluster scale, number and volume of resources, and resource access. CCE has optimized cluster performance and availability based on cloud native practices and has developed suggestions to protect against cluster overload. You can use these suggestions to ensure that your services run stably and reliably over the long term.

Category

Suggestion

Billing

Cluster

Keeping the Cluster Version Up to Date

N/A

Enabling Overload Control

N/A

Changing the Cluster Scale

The larger the cluster management scale, the higher the price. For details, see CCE Price Calculator.

Controlling Data Volume of Resources

N/A

Controlling the Frequency of Updating Resource Objects

N/A

Using Multiple Clusters

The costs vary with the number of clusters and the cluster management scale. For details, see CCE Price Calculator.

O&M

Enabling Observability

  • If Prometheus is used to monitor metrics of components on the controller plane nodes, the monitoring center will report related metrics to AOM. If the metrics reported to AOM are within the basic container metrics, there will be no extra costs. If the metrics are beyond the basic container metrics, you will be billed based on the number of reported metrics, retention period, and amount of data dumped. For details, see AOM Price Calculator.
  • After log collection is enabled, you will be billed for what you use. For details, see LTS Price Calculator.

Clearing Unused Resources

N/A

Application

Optimizing the Client Access Mode

N/A

Using ConsistentListFromCache

N/A

Strictly Controlling the Frequency and Scope of List Requests

N/A

Controlling the Frequency of Updating Resource Objects

In a Kubernetes cluster, the control plane typically experiences low load during stable operation and can efficiently handle routine tasks. However, during large-scale change operations, such as frequent resource creation and deletion, and rapid node scaling, the control plane load increases sharply. This surge in load can lead to cluster response delays, timeouts, or even temporary unavailability. These operations often involve a high volume of API requests, status synchronization, and resource scheduling, significantly increasing the resource consumption of components like the API server, etcd, and controller manager.

In a Kubernetes cluster, the control plane typically handles stable load with manageable pressure. For example, running 10,000 pods stably in a cluster with up to 2000 nodes results in controllable control plane load. However, if 10,000 jobs are created within one minute in a cluster with up to 500 nodes, a request peak will occur, leading to increased API server latency or even service interruptions.

Test data supports this observation. When 800 Deployments (each containing nine pods) are created in batches in a v1.32 cluster and the QPS reaches 110, the memory usage of kube-apiserver increases by approximately 20 GiB in a short period. Even when the QPS is reduced to 50, the memory usage still increases by 12 GiB.

Therefore, when performing large-scale resource changes, it is essential to control the change rate based on the current load, resource usage, and historical performance metrics of the cluster. It is advised to use progressive operations and real-time monitoring to ensure the stability of the control plane and prevent performance fluctuations. Additionally, leveraging cloud native observability capabilities, such as using Prometheus to monitor the component metrics of the master nodes, can help maintain cluster health. For details, see Enabling Observability.

Helpful Links