Updated on 2025-08-22 GMT+08:00

Auto Scaling

Pod Orchestration and Scheduling describes how to control the number of pods by using controllers such as Deployments. You can manually scale in or out applications by adjusting the number of pods, but manual scaling can be slow and complex, which is a problem when fast scaling is required to handle traffic surges.

To solve this, Kubernetes supports auto scaling for both pods and nodes. By defining auto scaling rules, Kubernetes can dynamically scale pods and nodes based on metrics like CPU usage.

Prometheus and Metrics Server

To enable auto scaling in Kubernetes, the system must first be able to monitor key performance metrics, such as CPU and memory usage for nodes, pods, and containers. However, Kubernetes does not include built-in monitoring capabilities. It instead relies on external projects to extend its functionality.

  • Prometheus is an open-source monitoring and alerting framework that collects a wide range of metrics, making it the standard monitoring solution for Kubernetes.
  • Metrics Server functions as a resource usage aggregator in Kubernetes clusters, pulling data from the Summary API exposed by kubelet. It provides standardized APIs for external systems, offering insights into core Kubernetes resources such as pods, nodes, containers, and Services.

Horizontal Pod Autoscaler (HPA) integrates with Metrics Server to implement auto scaling based on CPU and memory usage. Additionally, HPA can work with Prometheus to enable auto scaling using custom monitoring metrics.

How HPA Works

An HPA controls horizontal scaling of pods. It periodically checks pod metrics, calculates how many pods are needed to meet target values, and updates the replicas field of the associated workload such as a Deployment.

Figure 1 HPA working rules

You can configure one or more metrics for an HPA. When only one metric is used, the HPA totals the metric values from the current pods, divides that total by the expected value, and rounds up the result to determine the required number of pods. For example, if a Deployment has three pods with the CPU usage of each pod at 70%, 50%, and 90%, respectively, and the expected CPU usage configured for HPA is 50%, the expected number of pods is calculated as follows: (70 + 50 + 90)/50 = 4.2. The required number of pods is rounded up to 5.

If multiple metrics are configured, the expected number of pods of each metric is calculated, and the maximum value will be used.

Using an HPA

The following example demonstrates how to use an HPA. First, create a Deployment with four pods using an Nginx image.

$ kubectl get deploy
NAME               READY     UP-TO-DATE   AVAILABLE   AGE
nginx-deployment   4/4       4            4           77s

$ kubectl get pods
NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-7cc6fd654c-5xzlt   1/1       Running   0          82s
nginx-deployment-7cc6fd654c-cwjzg   1/1       Running   0          82s
nginx-deployment-7cc6fd654c-dffkp   1/1       Running   0          82s
nginx-deployment-7cc6fd654c-j7mp8   1/1       Running   0          82s

Create an HPA. The expected CPU usage is 70%, and the number of pods ranges from 1 to 10.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: scale
  namespace: default
spec:
  scaleTargetRef:                    # Target resource
    apiVersion: apps/v1
    kind: Deployment
    name: nginx-deployment
  minReplicas: 1                     # The minimum number of pods for the target resource
  maxReplicas: 10                    # The maximum number of pods for the target resource
  metrics:                           # Metric. The expected CPU usage is 70%.
  - type: Resource
    resource:
      name: cpu
      target: 
        type: Utilization
        averageUtilization: 70

Create the HPA and check its details.

$ kubectl create -f hpa.yaml
horizontalpodautoscaler.autoscaling/scale created

$ kubectl get hpa
NAME      REFERENCE                     TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
scale     Deployment/nginx-deployment   0%/70%    1         10        4          18s

In the command output, the expected value of TARGETS is 70%, but the actual value is 0%. This means that the HPA will scale in some pods. The expected number of pods can be calculated as follows: (0 + 0 + 0 + 0)/70 = 0. However, the minimum number of pods was set to 1, so the number of pods will be 1. After a while, you can see that there is only one pod.

$ kubectl get pods
NAME                                READY     STATUS    RESTARTS   AGE
nginx-deployment-7cc6fd654c-5xzlt   1/1       Running   0          7m41s

Check the HPA again. You can see that there is a record similar to the following under Events. This record shows that 21 seconds ago, the HPA scaled in the Deployment, reducing the total pod count to 1. The adjustment occurred because the number of pods calculated from all metrics fell below the expected value.

$ kubectl describe hpa scale
...
Events:
  Type    Reason             Age   From                       Message
  ----    ------             ----  ----                       -------
  Normal  SuccessfulRescale  21s   horizontal-pod-autoscaler  New size: 1; reason: All metrics below target

If you check the Deployment details again, you can see that there is a record similar to the following under Events. This record shows that the number of Deployment pods has been adjusted to 1, aligning with the HPA configuration.

$ kubectl describe deploy nginx-deployment
...
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  7m    deployment-controller  Scaled up replica set nginx-deployment-7cc6fd654c to 4
  Normal  ScalingReplicaSet  1m    deployment-controller  Scaled down replica set nginx-deployment-7cc6fd654c to 1

Cluster Autoscaler

HPAs focus on scaling pods, but when cluster resources become insufficient, the only option is to add nodes. Scaling cluster nodes can be complex, but in cloud-based environments, nodes can be dynamically added or removed using APIs, making the process much more convenient.

Kubernetes offers Cluster Autoscaler, a component designed to automatically scale cluster nodes based on pod scheduling demands and resource usage. However, because this relies on cloud provider APIs, the implementation and usage vary across different environments.

For details about the implementation in CCE, see Creating a Node Scaling Policy.