Hierarchical Queues

In multi-tenant scenarios, queues are a core mechanism for fair scheduling, resource isolation, and job priority control. In real-world applications, different queues usually belong to different departments, and there are hierarchical relationships between departments, which lead to more refined requirements for resource allocation and preemption. However, traditional peer queues cannot meet such requirements. To address this issue, Volcano Scheduler introduces hierarchical queues to implement resource allocation, sharing, and preemption between queues at different levels. With hierarchical queues, you can manage resource quotas at a finer granularity and build a more efficient unified scheduling platform.

Introduction to Hierarchical Queues

Hierarchical queues are used to implement hierarchical resource allocation and isolation in a multi-tenant cluster environment. Queues are organized in a tree structure, with the following functions:

The queue hierarchy can be configured. The parent attribute is added to QueueSpec of Volcano Scheduler. When creating a queue, you can use the parent attribute to specify the parent queue that a queue belongs to.
```
type QueueSpec struct {
    ...
   // Specify the parent queue that a queue belongs to.
    Parent string `json:"parent,omitempty" protobuf:"bytes,8,opt,name=parent"`
    ...
}
```
After Volcano Scheduler is started, a root queue is created by default. You can create a hierarchical queue tree based on the root queue.
You can set capability (maximum resources for a queue), deserved (if the allocated resources of a queue exceed the value of deserved, the excess resources may be reclaimed), and guarantee (resources reserved for a queue, which cannot be shared with other queues) for resources in each dimension.
Resources can be shared and reclaimed across hierarchical queues. If the cluster resources are insufficient for pod deployment, pod resources of queues at other levels can be reclaimed. The rules for reclaiming resources across hierarchical queues are as follows:
- If the allocated resources of a sibling queue exceed the deserved value, pod resources of the sibling queue are reclaimed first.
- If the resources in the sibling queue are insufficient to meet the requirements of the pod, the hierarchical structure of the queues (for example, ancestor queues) will be traversed upward to find sufficient resources.
In Figure 1, Job A and Job C are submitted first, and both the allocated resources of the queues exceed the deserved value. If the cluster resources are insufficient for Job B, the system preferentially reclaims resources from Job A. If the resources are still insufficient after resources from Job A are reclaimed, the system reclaims resources from Job C.

Figure 1 Reclaiming resources of hierarchical queues

Prerequisites

A CCE standard or Turbo cluster of v1.27 or later is available. For details about how to create a cluster, see Buying a CCE Standard/Turbo Cluster.
The Volcano Scheduler add-on of v1.17.1 or later has been installed. For details, see Volcano Scheduler.

Notes and Constraints

This feature is in the OBT phase. You can experience it. However, the stability has not been fully verified, and the CCE SLA does not apply.

Configuring a Hierarchical Queue Policy

After configuring a hierarchical queue policy, you can specify the hierarchical relationships between queues for sharing and reclaiming resources across queues and managing resource quotas at a finer granularity.

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Settings. Then click the Scheduling tab.
In Volcano Scheduler configuration, hierarchical queues are disabled by default. You need to modify the parameters to enable this feature.
1. In Default Cluster Scheduler > Expert mode, click Try Now.
  Figure 2 Expert mode > Try Now
2. Enable the capacity plugin and set enableHierarchy to true. The hierarchical queue capability relies on the capacity plugin. You also need to enable the reclaim action for resource reclamation between queues. When queue resources are insufficient, resource reclamation is triggered. The system preferentially reclaims resources that exceed the deserved value of the queue and selects an appropriate reclamation object based on the queue/job priority.
  
  The capacity plugin and proportion plugin conflict with each other. Ensure that the proportion plugin configuration has been removed when using the capacity plugin.
  Add the following parameters to the YAML file:
```
...
default_scheduler_conf:
  actions: allocate, backfill, preempt, reclaim     # Enable the reclaim action.
  metrics:
    interval: 30s
    type: ''
  tiers:
    - plugins:
        - name: priority
        - enableJobStarving: false
          enablePreemptable: false
          name: gang
        - name: conformance
    - plugins:
        - enablePreemptable: false
          name: drf
        - name: predicates
        - name: capacity             # Enable the capacity plugin.
          enableHierarchy: true     # Enable hierarchical queues.
        - name: nodeorder
        - arguments:
            binpack.cpu: 1
            binpack.memory: 1
            binpack.resources: nvidia.com/gpu
            binpack.resources.nvidia.com/gpu: 2
            binpack.weight: 10
          name: binpack
```
3. Click Save in the lower right corner.
Click Confirm Settings in the lower right corner. In the displayed dialog box, confirm the modification and click Save.

Use Case

Assume that there are 8 CPU cores and 16-GiB memory available for a cluster. First, create a hierarchical queue tree. Second, create two Volcano jobs (job-a and job-c) to exhaust cluster resources. Finally, create a Volcano job (job-b) and check the resource reclamation in hierarchical queues. Figure 3 shows the overall structure of this example.

Figure 3 Hierarchical queues
Click to enlarge

Create a YAML file for the hierarchical queue tree.

vim hierarchical_queue.yaml

The file content is as follows:

# The parent queue of child-queue-a is the root queue.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: child-queue-a
spec:
  reclaimable: true
  parent: root 
  capability:
    cpu: 5
    memory: 10Gi
  deserved:
    cpu: 4
    memory: 8Gi
---
# The parent queue of child-queue-b is the root queue.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: child-queue-b
spec:
  reclaimable: true
  parent: root 
  deserved:
    cpu: 4
    memory: 8Gi
---
# The parent queue of subchild-queue-a1 is child-queue-a.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: subchild-queue-a1
spec:
  reclaimable: true
  parent: child-queue-a
  # Set deserved as required. If the allocated resources of a queue exceed the value of deserved, resources used by the queue may be reclaimed.
  deserved: 
    cpu: 2
    memory: 4Gi
---
# The parent queue of subchild-queue-a2 is child-queue-a.
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: subchild-queue-a2
spec:
  reclaimable: true
  parent: child-queue-a 
  # Set deserved as required. If the allocated resources of a queue exceed the value of deserved, resources used by the queue may be reclaimed.
  deserved: 
    cpu: 2
    memory: 4Gi

The following uses the YAML file of child-queue-a as an example to describe the parameters of hierarchical queues. For more parameter information, see Queue | Volcano.

**Table 1** Hierarchical queue parameters
Parameter	Example Value	Description
reclaimable	true	(Optional) Specifies whether to enable the resource reclamation policy. true (default): If the resource usage of a queue exceeds the value of deserved, other queues can reclaim the resources that are overused by the queue. false: Other queues cannot reclaim the resources that are overused by the queue.
parent	root	(Optional) Specifies the parent queue. The queues are hierarchical, and the total resources of a child queue are limited by the parent queue. If parent is not specified, the parent queue is the root queue by default.
capability	cpu: 5 memory: 10Gi	(Optional) Specifies the upper limit of resources for the queue. The value cannot exceed the capability value of the parent queue. If the capability value of a resource is not set for a queue, the capability value of the resource inherits the setting of its parent queue. If the parent queue and all its ancestor queues are not set, the settings of the root queue are inherited. By default, the capability value of the root queue is set to the total available resource in the cluster.
deserved	cpu: 4 memory: 8Gi	Specifies the resources that should be obtained by a queue. The total deserved values of child queues cannot exceed the deserved value configured for the parent queue, and the deserved value of a queue must be less than or equal to the capability value. The default deserved value of the root queue is the same as its capability value. If the resources allocated to the queue exceed the deserved value, the queue cannot reclaim resources from other queues.

Create a hierarchical queue tree.

kubectl apply -f hierarchical_queue.yaml

Information similar to the following is displayed:

queue.scheduling.volcano.sh/child-queue-a created
queue.scheduling.volcano.sh/child-queue-b created
queue.scheduling.volcano.sh/subchild-queue-a1 created
queue.scheduling.volcano.sh/subchild-queue-a2 created

Create YAML files for Volcano jobs (job-a and job-b). job-a is submitted to subchild-queue-a1, and job-b to child-queue-b.

vim vcjob.yaml

The file content is as follows:

# Submit job-a to the leaf queue subchild-queue-a1.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-a
spec:
  queue: subchild-queue-a1
  schedulerName: volcano
  minAvailable: 1
  tasks:
    - replicas: 3
      name: test
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 1000"]
              imagePullPolicy: IfNotPresent
              name: alpine
              resources:
                requests:
                  cpu: "1"
                  memory: 2Gi
---
# Submit job-c to the leaf queue child-queue-b.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-c
spec:
  queue: child-queue-b
  schedulerName: volcano
  minAvailable: 1
  tasks:
    - replicas: 5
      name: test
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 1000"]
              imagePullPolicy: IfNotPresent
              name: alpine
              resources:
                requests:
                  cpu: "1"
                  memory: 2Gi

Create job-a and job-c.

kubectl apply -f vcjob.yaml

Information similar to the following is displayed:

job.batch.volcano.sh/job-a created
job.batch.volcano.sh/job-c created

Check the pod statuses.

kubectl get pod

If the following information is displayed and the status of each pod is Running, the cluster CPU and memory are used up.

NAME           READY        STATUS        RESTARTS       AGE
job-a-test-0   1/1          Running       0              3h21m
job-a-test-1   1/1          Running       0              3h31m
job-a-test-2   1/1          Running       0              3h31m
job-c-test-0   1/1          Running       0              24m
job-c-test-1   1/1          Running       0              24m
job-c-test-2   1/1          Running       0              24m
job-c-test-3   1/1          Running       0              24m
job-c-test-4   1/1          Running       0              24m

Create a YAML file for job-b.

vim vcjob1.yaml

The file content is as follows:

# Submit job-b to the leaf queue subchild-queue-a2.
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: job-b
spec:
  queue: subchild-queue-a1
  schedulerName: volcano
  minAvailable: 1
  tasks:
    - replicas: 2
      name: test
      template:
        spec:
          containers:
            - image: alpine
              command: ["/bin/sh", "-c", "sleep 1000"]
              imagePullPolicy: IfNotPresent
              name: alpine
              resources:
                requests:
                  cpu: "1"
                  memory: 2Gi

Create job-b.
```
kubectl apply -f vcjob1.yaml
```
Information similar to the following is displayed:
```
job.batch.volcano.sh/job-b created
```
Resource reclamation is triggered because the cluster CPU and memory are used up.
- job-b first checks job-a in the sibling queue. The resources (3 CPU cores and 6-GiB memory) occupied by job-a exceed the deserved resources (2 CPU cores and 4-GiB memory) of subchild-queue-a1. The over-occupied resources (1 CPU core and 2-GiB memory) can be preferentially reclaimed, but the reclaimed resources still cannot meet the requirements of job-b.
- job-b then searches for resources in the upper-level queue and finally finds job-c in child-queue-b for resource reclamation.

Check the pod statuses and verify that resources have been reclaimed.

kubectl get pod

If the following information is displayed, the system is reclaiming resources:

NAME           READY        STATUS            RESTARTS       AGE
job-a-test-0   1/1          Running           0              3h33m
job-a-test-1   1/1          Running           0              3h33m
job-a-test-2   1/1          Terminating       0              3h33m
job-b-test-0   0/1          Pending           0              1m
job-b-test-1   0/1          Pending           0              1m
job-c-test-0   1/1          Running           0              26m
job-c-test-1   1/1          Running           0              26m
job-c-test-2   1/1          Running           0              26m
job-c-test-3   1/1          Running           0              26m
job-c-test-4   1/1          Terminating       0              26m

Wait for several minutes and run the preceding command again to check the pod statuses. If the following information is displayed, job-b has been executed. After job-b is complete and resources are released, the pod whose resources have been reclaimed will run again.

NAME           READY        STATUS            RESTARTS       AGE
job-a-test-0   1/1          Running           0              3h35m
job-a-test-1   1/1          Running           0              3h35m
job-a-test-2   0/1          Pending           0              3h35m
job-b-test-0   1/1          Running           0              2m
job-b-test-1   1/1          Running           0              2m
job-c-test-0   1/1          Running           0              28m
job-c-test-1   1/1          Running           0              28m
job-c-test-2   1/1          Running           0              28m
job-c-test-3   1/1          Running           0              28m
job-c-test-4   0/1          Pending           0              28m

Check the pod statuses again and check whether job-a-test-2 and job-c-test-4 are re-executed.

kubectl get pod

If the following information is displayed, the pod whose resources have been reclaimed is running again.

NAME           READY        STATUS            RESTARTS       AGE
job-a-test-0   1/1          Running           0              3h48m
job-a-test-1   1/1          Running           0              3h48m
job-a-test-2   1/1          Running           1              3h48m
job-b-test-0   0/1          Completed         0              15m
job-b-test-1   0/1          Completed         0              15m
job-c-test-0   1/1          Running           0              40m
job-c-test-1   1/1          Running           0              30m
job-c-test-2   1/1          Running           0              40m
job-c-test-3   1/1          Running           0              40m
job-c-test-4   1/1          Running           1              40m