Pod Compaction

Pod compaction allows controllers to preferentially release pods that create NPU fragments, based on the NPU fragmentation on the host nodes during workload scale-in. This policy frees up larger, more contiguous NPU resources, thereby improving overall NPU utilization in clusters and minimizing resource waste. Specifically, Volcano Scheduler assigns scores to pods managed by Deployments (including ReplicaSets) and other specified workload types according to the pod compaction policy. Pods on the nodes with lower NPU utilization receive lower scores. During scale-in, the controller prioritizes the removal of pods with lower scores. When multiple pods share the same score, the controller selects pods for eviction at random. This approach helps optimize the success rate of subsequent pod scheduling and improves overall NPU utilization.

For example, suppose a Deployment has two pods on each of two nodes, and each pod uses one NPU. One node has no idle NPUs, while the other has four idle NPUs. When two pods of the Deployment need to be removed, the controller prioritizes removing the two pods with lower scores from the node with idle NPUs according to their NPU fragment scores. This releases more NPU resources, allowing for more efficient scheduling of other services.

Figure 1 Pod compaction example
Click to enlarge

Notes and Constraints

The Volcano Scheduler add-on must be installed in the cluster, and the add-on version must be 1.18.1 or later.
Pod compaction only supports scoring based on the NPU fragments on nodes.
By default, pod compaction supports Deployments (including ReplicaSets). To enable support for third-party workloads, refer to Enabling Pod Compaction.
If pod compaction is disabled, existing scale-in score annotations in pods are not cleared immediately. Pods with uncleared annotations will still be processed according to the pod compaction policy. After a pod is restarted, its score annotation is automatically removed, and the pod compaction policy will no longer apply to it.

Enabling Pod Compaction

Log in to the CCE console and click the cluster name to access the cluster console. The Overview page is displayed.
In the navigation pane, choose Settings. Then, click the Scheduling tab. In the Default Cluster Scheduler area, select Volcano for Default Scheduler and click Try Now in Expert mode.

In the window that slides out from the right, configure the following parameters to enable pod compaction:

...
enable_scale_in_score: true
scale_in_priority: binpack
workload_balancer_score_annotation_key: 'controller.kubernetes.io/pod-deletion-cost'
workload_balancer_scoring_qps: 100
workload_balancer_third_party_types: ''
...

**Table 1** Parameters of pod compaction
Parameter	Description
enable_scale_in_score	Whether to enable pod compaction. Options: true: Pod compaction is enabled. false: Pod compaction is disabled. This is the default option.
scale_in_priority	The priority of the pod compaction policy when there are multiple scale-in policies. Options: binpack: Pods on the nodes with low NPU utilization are preferentially considered during scale-in. This is the default option. wl: Application scaling priorities (Application Scaling Priority Policies) are preferentially considered during scale-in. Before using this option, enable the application scaling priority policies.
workload_balancer_third_party_types	The application of pod compaction to third-party workloads. The value is a character string in the format of <group>/<version>/<resources>, where resources must be in plural form, of the third-party workloads. Multiple workload resource types are separated by commas (,). For example, apps.kruise.io/v1alpha1/clonesets indicates OpenKruise CloneSets. If the format is incorrect, Volcano will exit abnormally. If the specified CRD is not present in the cluster, pod compaction cannot work properly. To apply pod compaction to a third-party workload, the controller must implement the corresponding scale-in logic. Specifically, the controller should preferentially remove pods with lower scores in annotations during scale-in. For details about the code implementation, see the pod scale-in priorities in the Kubernetes source code.
workload_balancer_score_annotation_key	The annotation key used for recording pod scale-in scores, which are calculated and written by Volcano. These scores help determine the order in which pods are removed during workload scale-in. Supported values include: controller.kubernetes.io/pod-deletion-cost: the officially recognized annotation key by the Kubernetes community. For details, see ReplicaSet. openvessel.io/workload-balancer-score: If workload_balancer_score_annotation_key is left blank, the annotation name is used by default. Before configuring this annotation name, you need to enable Application Scaling Priority Policies.
workload_balancer_scoring_qps	The QPS limit for how frequently pod scoring requests can be made. This value must be a positive integer. However, the actual QPS limit is constrained by the limit settings of the API server.

Click Confirm Settings in the lower right corner. In the displayed dialog box, confirm the modification and click Save.

Example of Using Pod Compaction

This example demonstrates how to use pod compaction after it is enabled. In this example, the related parameters as follows:

...
enable_scale_in_score: true
scale_in_priority: binpack
workload_balancer_score_annotation_key: 'controller.kubernetes.io/pod-deletion-cost'
...

Create a Deployment.

Create a YAML file for the workload. You can enter a file name as needed.

vim deploy1.yaml

Example file content:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deploy1
spec:
  replicas: 2
  selector:
    matchLabels:
      app: deploy1
  template:
    metadata:
      labels:
        app: deploy1
    spec:
      containers:
        - name: container1
          image: busybox:latest   # Replace it with the image to be used.
          command: ["sh", "-c", "echo Hello, Kubernetes! && sleep 3600"]
          resources:
            requests:
              cpu: 100m
              memory: 200Mi
              huawei.com/ascend-310: "2"  # Must be consistent with limits.huawei.com/ascend-310.
            limits:
              cpu: 100m
              memory: 200Mi
              huawei.com/ascend-310: "2"   # The number of requested NPU resources

Create the workload.
```
kubectl apply -f deploy1.yaml
```
Information similar to the following is displayed:
```
deployment.apps/deploy1 created
```

Check the pod statuses.

kubectl get pod

In the command output, all pods are in the Running state. This indicates that the workload has been created.

NAME                           READY   STATUS    RESTARTS        AGE 
deploy1-b8c6765cd-8tkdx        1/1     Running   0               2s 
deploy1-b8c6765cd-cwzsk        1/1     Running   0               2s

Obtain the scale-in priority score of each pod.

Obtain the details of deploy1-b8c6765cd-8tkdx.

kubectl describe pod deploy1-b8c6765cd-8tkdx

Information similar to the following is displayed:

Name:             deploy1-b8c6765cd-8tkdx
Namespace:        default
Labels:           app=deploy1
                  pod-template-hash=b8c6765cd
Annotations:      ......
                  controller.kubernetes.io/pod-deletion-cost: -200  # Scale-in priority score of the pod
                  ......
Status:           Running
......

Obtain the details of deploy1-b8c6765cd-cwzsk.

kubectl describe pod deploy1-b8c6765cd-cwzsk

Information similar to the following is displayed:

Name:             deploy1-b8c6765cd-cwzsk
Namespace:        default
Labels:           app=deploy1
                  pod-template-hash=b8c6765cd
Annotations:      ......
                  controller.kubernetes.io/pod-deletion-cost: -100  # Scale-in priority score of the pod
                  ......
Status:           Running
......

Check whether pod compaction works, which means, check whether the pod with a lower scale-in priority score has been removed.
1. Scale in the workload.
```
kubectl scale --replicas=1 deployment deploy1    # Replace deploy1 with the name of the workload to be scaled in.
```
  Information similar to the following is displayed:
```
deployment.apps/deploy1 scaled
```
2. Check whether deploy1-b8c6765cd-8tkdx with a lower score has been removed.
```
kubectl get pod
```
  Information similar to the following is displayed:
```
NAME                           READY   STATUS    RESTARTS      AGE 
deploy1-b8c6765cd-cwzsk        1/1     Running   0             10m
```
  The command output shows that the deploy1-b8c6765cd-8tkdx pod has been removed. This means that pod compaction has taken effect for the workload.