Protecting a CCE Cluster Against Overload
Cluster overload occurs when a Kubernetes cluster's compute, storage, or network resources exceed its processing capacity, leading to exhaustion of key control plane components (like etcd and kube-apiserver) or worker nodes. This can severely degrade cluster performance or even cause operational failures. To prevent this, proactive overload protection is essential. Policies like overload control and LIST request optimization help maintain the stability of core services, ensuring that sudden spikes in loads do not result in service disruptions. This section explores the causes and impacts of cluster overload, explains CCE cluster overload control, and provides best practices for maintaining stability.
Causes of Cluster Overload
- Resource request exceeding cluster capacity: Core components on the cluster control plane include etcd and kube-apiserver. etcd is a backend database that stores all cluster data. kube-apiserver acts as the entry to the control plane and processes requests. kube-apiserver caches cluster data to lessen the burden on etcd. Other core components in the cluster also cache various resources and monitor changes to these resources. When the demand for compute, network, and storage resources surpasses the cluster capacity, these components become heavily loaded. If this load exceeds a certain threshold, the cluster becomes overloaded.
- Lots of data queries (such as multiple LIST requests or a single request retrieving large amounts of data): If a client obtains pod data using field selectors, but kube-apiserver cache does not contain the requested information, the client must fetch the full pod data from etcd, since etcd cannot get data by field. Once retrieved, kube-apiserver deserializes the pod data into structured objects in memory, traverses each pod to match the requested fields, and returns the obtained serialized results. When many concurrent queries occur, resource utilization of each component increases sharply, leading to issues such as etcd latency spikes, OOM errors in kube-apiserver, and control loop imbalances. Consequently, the entire cluster becomes overloaded.
Figure 1 Example of a large amount of data obtained from a client
Impacts of Cluster Overload
When a cluster experiences overload, Kubernetes API response delay increase, and resource utilization on control plane nodes rises, affecting both the control plane and related services. Below are key areas impacted:
- Kubernetes resource management: Operations such as creating, deleting, updating, or obtaining Kubernetes resources may fail.
- Kubernetes leader election: Overload can prevent a leader node lease renewal from completing on time. If the lease expires, the leader node loses its leadership role, triggering re-election. This can cause temporary service interruptions, task migrations, scheduling delays, and fluctuations in cluster performance.
- Cluster management failures: Severe overload may make a cluster unavailable, preventing management operations like node creation or deletion.
CCE Overload Control
- Overload control: CCE clusters have supported overload control since v1.23, which reduces the number of LIST requests outside the system when the control plane experiences high resource usage pressure. To use this function, enable overload control for your clusters. For details, see Enabling Overload Control for a Cluster.
- Optimized processes on LIST requests: Starting from CCE clusters of v1.23.8-r0 and v1.25.3-r0, processes on LIST requests have been optimized. Even if a client does not specify the resourceVersion parameter, kube-apiserver responds to requests based on its cache to avoid additional etcd queries and ensure that the response data is up to date. Additionally, namespace indexes are now added to the kube-apiserver cache. This allows clients to request specified resources in a namespace without needing to fetch full data for that namespace. This effectively reduces the response delay and control plane memory overhead.
- Refined traffic limit policy on the server: The API Priority and Fairness (APF) feature is used to implement fine-grained control on concurrent requests. For details, see API Priority and Fairness.
Configuration Suggestions
When running services in Kubernetes clusters, the cluster performance and availability are influenced by various factors, including the cluster scale, number and volume of resources, and resource access. CCE has optimized cluster performance and availability based on cloud native practices and has developed suggestions to protect against cluster overload. You can use these suggestions to ensure that your services run stably and reliably over the long term.
Category |
Suggestion |
Billing |
---|---|---|
Cluster |
Keep the cluster version up to date. |
N/A |
Enable overload control. |
N/A |
|
Change the cluster scale. |
The larger the cluster management scale, the higher the price. For details, see CCE Price Calculator. |
|
Control data volume of resources. |
N/A |
|
Use multiple clusters. |
The costs vary with the number of clusters and the cluster management scale. For details, see CCE Price Calculator. |
|
O&M |
Enable observability. |
|
Clear unused resources. |
N/A |
|
Application |
Optimize the client access mode. |
N/A |
Helpful Links
- Which Resource Quotas Should I Pay Attention To When Using CCE?
- For details about how to properly plan the VPC network and container network in a cluster, see Planning CIDR Blocks for a Cluster.
- For details about how to ensure workload stability and reliability, see Recommended Configurations for Workloads.
- For details about how to prevent and avoid high-risk operations on a cluster, see High-Risk Operations.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot