Help Center/ Cloud Container Engine/ User Guide/ Add-ons/ Scheduling and Elasticity Add-ons/ Volcano Scheduler

Updated on 2026-06-16 GMT+08:00

Volcano Scheduler

Introduction

Volcano is a batch processing platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.

Volcano provides general computing capabilities such as a high-performance job scheduling engine, heterogeneous chip management, and job running management. It accesses the computing frameworks for various industries such as AI, big data, gene, and rendering and schedules up to 1000 pods per second for end users, greatly improving scheduling efficiency and resource utilization.

Volcano provides job scheduling, job management, and queue management for computing applications. Its main features are as follows:

Diverse computing frameworks, such as TensorFlow, MPI, and Spark, can run on Kubernetes in containers. Common APIs for batch computing jobs through CRDs, various plugins, and advanced job lifecycle management are provided.
Advanced scheduling capabilities are provided for batch computing and high-performance computing scenarios, including group scheduling, priority-based preemptive scheduling, bin packing, resource reservation, and task topology-aware scheduling.
Queues can be effectively managed for scheduling jobs. Complex job scheduling capabilities such as queue priority and multi-level queues are supported.

Volcano has been open-sourced in GitHub at https://github.com/volcano-sh/volcano.

This section describes how to install and configure the Volcano Scheduler add-on in CCE clusters. For details, see Volcano Scheduling.

When using Volcano as a scheduler, use it to schedule all workloads in the cluster. This prevents resource scheduling conflicts caused by simultaneous working of multiple schedulers.

Notes and Constraints

If the Volcano Scheduler add-on is upgraded from 1.4.7 or earlier to a version later than 1.4.7 and the add-on is deployed in the HA mode, the webhooks.admissionReviewVersions field information in the new version may be incompatible with that in the old version. As a result, VolcanoJob (vcjob) resources cannot be created. You can take the preventive measures by referring to Workaround.

Installing the Add-on

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Add-ons. Locate Volcano Scheduler on the right and click Install.

On the Install Add-on page, configure the specifications as needed.

If you selected Preset, the system will configure the number of pods and resource quotas for the add-on based on the preset specifications. You can see the configurations on the console.

If you selected Custom, you can adjust the number of pods and resource quotas as needed. High availability is not possible with a single pod. If an error occurs on the node where the add-on pod runs, the add-on will fail.

The resource quotas of the volcano-admission component are related to the cluster scale. For details, see Table 1. The resource quotas of volcano-controller and volcano-scheduler are related to the number of cluster nodes and pods. The recommended values are as follows:

If the number of nodes is less than 100, retain the default configuration. The requested vCPUs are 500m, and the limit is 2000m. The requested memory is 500 MiB, and the limit is 2000 MiB.
If the number of nodes is greater than 100, increase the requested vCPUs by 500m and the requested memory by 1000 MiB each time 100 nodes (10,000 pods) are added. Set the vCPU limit to be 1500m higher than the requested vCPUs, and the memory limit to be 1000 MiB higher than the requested memory.
Recommended formula for calculating the requested value:
- Requested vCPUs: Calculate the number of target nodes multiplied by the number of target pods, perform interpolation search based on the number of nodes in the cluster multiplied by the number of target pods in Table 2, and round up the request value and limit value that are closest to the specifications.
  For example, for 2000 nodes and 20,000 pods, the requested vCPUs are 40 million (Number of target nodes × Number of target pods = 40 million), which is close to the specification of 700/70000 (Number of cluster nodes × Number of pods = 49 million). According to the table below, set the requested vCPUs to 4000m and the limit value to 5500m.
- Requested memory: It is recommended that 2.4 GiB memory be allocated to every 1000 nodes and 1 GiB memory be allocated to every 10,000 pods. The requested memory is the sum of these two values. (The obtained value may be different from the recommended value in Table 2. You can use either of them.)
  Requested memory = Number of target nodes/1000 × 2.4 GiB + Number of target pods/10,000 × 1 GiB
  
  For example, for 2000 nodes and 20,000 pods, the requested memory is 6.8 GiB (2000/1000 × 2.4 GiB + 20,000/10,000 × 1 GiB).

**Table 1** Recommended requested resources and resource limits for volcano-admission
Cluster Scale	CPU Request (m)	CPU Limit (m)	Memory Request (MiB)	Memory Limit (MiB)
50 nodes	200	500	500	500
200 nodes	500	1000	1000	2000
1000 or more nodes	1500	2500	3000	4000

**Table 2** Recommended requested resources and resource limits for volcano-controller and volcano-scheduler
Nodes/Pods in a Cluster	CPU Request (m)	CPU Limit (m)	Memory Request (MiB)	Memory Limit (MiB)
50/5000	500	2000	500	2000
100/10000	1000	2500	1500	2500
200/20000	1500	3000	2500	3500
300/30000	2000	3500	3500	4500
400/40000	2500	4000	4500	5500
500/50000	3000	4500	5500	6500
600/60000	3500	5000	6500	7500
700/70000	4000	5500	7500	8500

Configure the extended functions supported by the add-on.
- Descheduling: After this function is enabled, the volcano-descheduler component is automatically deployed. The scheduler will evict and reschedule pods that do not meet your policy configuration requirements. This helps to balance cluster load and reduce resource fragmentation. For details, see Descheduling.
- Hybrid Service Deployment: After this function is enabled, the volcano-agent component will be automatically deployed in node pools that support hybrid deployment. This reduces resource usage costs and improves resource utilization through node QoS, CPU bursts, and dynamic resource oversubscription. For details, see Enabling Cloud Native Hybrid Deployment.
- NUMA Topology Scheduling: After this function is enabled, the resource-exporter component is deployed automatically. The scheduler then places pods with NUMA affinity, improving performance for high-performance training jobs. For details, see NUMA Affinity Scheduling.
Configure Agency Settings. This configuration requires hybrid deployment of online and offline services (enabled in extended functions) and add-on v1.21.4 or later.
Add-ons require runtime access to other cloud services. You need to configure an add-on agency to authorize these operations. For details, see Custom Agencies for Add-ons.
- Create automatically: CCE automatically creates a default agency and uses this agency for the add-on. If there is a default agency, CCE will not create one again.
- Use existing: Select one from the drop-down list. Ensure that the created agency has been assigned the required permissions. Otherwise, the add-on functions may be unavailable.

Configure deployment policies for the add-on pods.

Scheduling policies do not take effect on the DaemonSet pods of the add-on.
When configuring multi-AZ deployment or node affinity, ensure that there are nodes meeting the scheduling policy and that resources are sufficient in the cluster. Otherwise, the add-on pods cannot run.

**Table 3** Configurations for add-on scheduling
Parameter	Description
Multi-AZ Deployment	Preferred: Deployment pods of the add-on will be preferentially scheduled to nodes in different AZs. If all the nodes in the cluster are deployed in the same AZ, the pods will be scheduled to different nodes in that AZ. Forcible: Deployment pods of the add-on are forcibly scheduled to nodes in different AZs. There can be at most one pod in each AZ. If the nodes in a cluster are not in different AZs, some add-on pods cannot run properly. If a node is faulty, the add-on pods on it may fail to be migrated.
Node Affinity	Not configured: Node affinity is disabled for the add-on pods. Specify node: Specify the nodes where the add-on pods are deployed. If you do not specify the nodes, the add-on pods will be randomly scheduled based on the default cluster scheduling policy. Specify node pool: Specify the node pool where the add-on pods are deployed. If you do not specify the node pools, the add-on pods will be randomly scheduled based on the default cluster scheduling policy. Customize affinity: Enter the labels of the nodes where the add-on pods are to be deployed for more flexible scheduling policies. If you do not specify node labels, the add-on pods will be randomly scheduled based on the default cluster scheduling policy. If multiple custom affinity policies are configured, ensure that there are nodes that meet all the affinity policies in the cluster. Otherwise, the add-on pods cannot run.
Toleration	Using both taints and tolerations enables (but does not require) the add-on's Deployment pods to be scheduled on nodes with matching taints, and allows control over pod eviction policies when host nodes are tainted. The add-on applies default toleration policies for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints on pods. The time window is 60s. For details, see Configuring Tolerance Policies.

Click Install.

After the add-on is installed, you can choose Settings in the navigation pane, switch to the Scheduling tab, select Volcano, and find the expert mode. You can customize advanced scheduling policies based on actual service scenarios. The following is an example:

admission_kube_api_qps: 200
admissions: /jobs/mutate,/jobs/validate,/podgroups/mutate,/pods/validate,/pods/mutate,/queues/mutate,/queues/validate,/eas/pods/mutate,/eas/pods/validate,/npu/jobs/validate,/resource/validate,/resource/mutate,/workloadbalancer/balancer/validate,/workloadbalancer/balancerpolicytemplate/validate
annotations: {}
colocation_enable: 'false'
controller_kube_api_qps: 200
default_scheduler_conf:
  actions: allocate, backfill, preempt
  metrics:
    interval: 30s
    type: ''
  tiers:
    - plugins:
        - name: priority
        - enableJobStarving: false
          enablePreemptable: false
          name: gang
        - name: conformance
    - plugins:
        - enablePreemptable: false
          name: drf
        - name: predicates
        - name: nodeorder
    - plugins:
        - name: cce-gpu-topology-predicate
        - name: cce-gpu-topology-priority
        - name: xgpu
    - plugins:
        - name: nodelocalvolume
        - name: nodeemptydirvolume
        - name: nodeCSIscheduling
        - name: networkresource
deschedulerPolicy:
  profiles:
    - name: ProfileName
      pluginConfig:
        - args:
            nodeFit: true
          name: DefaultEvictor
        - args:
            evictableNamespaces:
              exclude:
                - kube-system
            thresholds:
              cpu: 20
              memory: 20
          name: HighNodeUtilization
        - args:
            evictableNamespaces:
              exclude:
                - kube-system
            metrics:
              type: prometheus_adaptor
            nodeFit: true
            targetThresholds:
              cpu: 80
              memory: 85
            thresholds:
              cpu: 30
              memory: 30
          name: LoadAware
      plugins:
        balance:
          enabled: null
descheduler_enable: 'false'
deschedulingInterval: 10m
enable_workload_balancer: false
oversubscription_method: nodeResource
oversubscription_profile_period: 300
oversubscription_ratio: 60
recommendation_enable: ''
scheduler_kube_api_qps: 200
update_pod_status_qps: 50
workload_balancer_score_annotation_key: ''
workload_balancer_third_party_types: ''

**Table 4** Advanced Volcano configuration parameters
Function	Parameter	Function	Description
Basic scheduling functions	admission_kube_api_qps	QPS of requests sent by volcano-admission to Kubernetes API server	Default value: 200; parameter type: float
	controller_kube_api_qps	QPS of requests sent by volcano-controller to Kubernetes API server	Default value: 200; parameter type: float
	scheduler_kube_api_qps	QPS of requests sent by volcano-scheduler to Kubernetes API server	Default value: 200; parameter type: float
	update_pod_status_qps	QPS of the requests sent by volcano-scheduler to update pod statuses	Default value: 50; parameter type: float
	default_scheduler_conf	Used to schedule pods. It consists of a series of actions and plugins and features high scalability. You can specify and use actions and plugins as needed.	It consists of: actions: defines the types and sequence of actions to be executed by the scheduler. tiers: configures the plugin list.
	default_scheduler_conf.actions	Actions to be executed in each scheduling phase. The configured action sequence is the scheduler execution sequence. For details, see Actions. The scheduler traverses all jobs to be scheduled and performs actions such as enqueue, allocate, preempt, reclaim, and backfill in the configured sequence to find the most appropriate node for each job.	The following options are supported: enqueue: uses a series of filtering algorithms to filter out tasks to be scheduled and sends them to the queue to wait for scheduling. After this action, the task status changes from pending to inqueue. allocate: selects the most suitable node based on a series of pre-selection and selection algorithms. preempt: performs preemption scheduling for tasks with higher priorities in the same queue based on priority rules. reclaim: retrieves the resources needed by a queue based on its weight when a new task enters the queue for scheduling, but the cluster resources are insufficient to meet the queue's requirements. backfill: schedules tasks in the pending state as much as possible to maximize node resource utilization. Example: actions: 'allocate, backfill, preempt, reclaim' NOTE: When configuring actions, use only one of preempt, reclaim, or enqueue.
	default_scheduler_conf.tier.plugin	Implementation details of algorithms in actions based on different scenarios. For details, see Plugins.	For details, see Table 5.
Descheduling	descheduler_enable	Used to enable descheduling.	This function is disabled by default. Options: true: The function is enabled. false or empty: The function is disabled.
	deschedulerPolicy	Descheduling policy	For details about the parameters, see Table 2.
	deschedulingInterval	Descheduling period	Value range: > 0s; parameter type: time
Cloud native hybrid deployment	colocation_enable	Used to enable cloud native hybrid deployment.	This function is disabled by default. Options: true: The function is enabled. false or empty: The function is disabled.
	oversubscription_method	Method for calculating the oversubscription	nodeResource and podProfile are supported. The default value is nodeResource. nodeResource: calculates the oversubscription based on the node resource usage. podProfile: calculates the oversubscription based on pod profiling. For details, see Resource Oversubscription Based on Pod Profiling.
	oversubscription_ratio	Percentage of idle resource oversubscription of a node	Value range: 1 to 100; parameter type: int For example, 60 indicates that the maximum oversubscription resources on a node are calculated based on 60% x Idle resources on the node.
	oversubscription_profile_period	Period of pod profiling	Value range: 60 to 2592000, in seconds, that is, from 1 minute to 1 month. For pods whose metric collection duration has not reached the specified cycle, the resource usage on the hosting node will be calculated based on the pods' resource requests. When the oversubscription algorithm based on pod profiling is enabled for the first time, the amount of collected data may not be sufficient to cover the entire period. In this case, the oversubscription on the node is temporarily 0 due to lack of initialization data. After the data of the first period is collected, the oversubscription is updated to the actual value.
Application scaling priority policies	enable_workload_balancer	Used to enable cloud native hybrid deployment.	This function is disabled by default. Options: true: The function is enabled. false or empty: The function is disabled.
	workload_balancer_score_annotation_key	Used to specify the score annotation key of a pod.	Only the following parameters are supported: openvessel.io/workload-balancer-score: indicates a pod's score, which is higher if the pod is on a high-priority node. controller.kubernetes.io/pod-deletion-cost: specifies the balancer object that controls the current pod. Pods with low scores are preferentially scaled in. Other values will cause Volcano to exit abnormally.
	workload_balancer_third_party_types	A character string consisting of the group, version, and kind of the third-party workload. CRDs are separated by commas (,).	Example: apps.kruise.io/v1alpha1/clonesets,apps.kruise.io/v1beta1/statefulsets. Note that kind must be in plural form. If the format is incorrect, Volcano will exit abnormally. If the specified CRD is not present in the cluster, the application scaling priority policy cannot work properly.

**Table 5** Supported plugins
Plugins	Function	Description	Demonstration
binpack	Schedule pods to nodes with high resource usage (not allocating pods to light-loaded nodes) to reduce resource fragments.	arguments: binpack.weight: weight of the binpack plugin. binpack.cpu: ratio of CPUs to all resources. The parameter value defaults to 1. binpack.memory: ratio of memory resources to all resources. The parameter value defaults to 1. binpack.resources: other custom resource types requested by the pod, for example, nvidia.com/gpu. Multiple types can be configured and be separated by commas (,). binpack.resources.<your_resource>: weight of your custom resource in all resources. Multiple types of resources can be added. <your_resource> indicates the resource type defined in binpack.resources, for example, binpack.resources.nvidia.com/gpu.	- plugins: - name: binpack arguments: binpack.weight: 10 binpack.cpu: 1 binpack.memory: 1 binpack.resources: nvidia.com/gpu, example.com/foo binpack.resources.nvidia.com/gpu: 2 binpack.resources.example.com/foo: 3
conformance	Prevent key pods, such as the pods in the kube-system namespace from being evicted.	None	- plugins: - name: 'priority' - name: 'gang' enablePreemptable: false - name: 'conformance'
lifecycle	By collecting statistics on service scaling rules, it preferentially schedules pods with similar lifecycles to the same node. Working with horizontal auto scaling of Autoscaler, it then scales in pods and releases resources to reduce costs and improve resource utilization. 1. Collects statistics on the lifecycle of service pods and schedules pods with similar lifecycles to the same node. 2. For a cluster configured with an automatic scaling policy, adjust the scale-in annotation of the nodes to preferentially scale in the ones with low usage.	arguments: lifecycle.WindowSize: The value is an integer greater than or equal to 1 and defaults to 10. It records how many times the number of pods has changed. If the load changes regularly and periodically, decrease the value. If the load changes irregularly and the number of replicas changes frequently, increase the value. If the value is too large, the learning period is prolonged and too many events are recorded. lifecycle.MaxGrade: The value is an integer greater than or equal to 3 and defaults to 3. It indicates levels of replicas. For example, if the value is set to 3, the replicas are classified into three levels. If the load changes regularly and periodically, decrease the value. If the load changes irregularly, increase the value. Setting an excessively small value may result in inaccurate lifecycle forecasts. lifecycle.MaxScore: float64 floating point number. The value must be greater than or equal to 50.0. The default value is 200.0. It indicates the maximum score (equivalent to the weight) of the lifecycle plugin. lifecycle.SaturatedTresh: float64 floating point number. If the value is less than 0.5, use 0.5. If the value is greater than 1, use 1. The default value is 0.8. It specifies the threshold for determining whether the node usage is too high. If the node usage exceeds the threshold, the scheduler preferentially schedules jobs to other nodes.	- plugins: - name: priority - name: gang enablePreemptable: false - name: conformance - name: lifecycle arguments: lifecycle.MaxGrade: 3 lifecycle.MaxScore: 200.0 lifecycle.SaturatedTresh: 0.8 lifecycle.WindowSize: 10 NOTE: For nodes that do not want to be scaled in, manually mark them as long-period nodes and add the annotation volcano.sh/long-lifecycle-node: true to them. For an unmarked node, the lifecycle plugin automatically marks the node based on the lifecycle of the load on the node. The default value of MaxScore is 200.0, which is twice the weight of other plugins. When the lifecycle plugin does not have obvious effect or conflicts with other plugins, disable other plugins or increase the value of MaxScore. After the scheduler is restarted, the lifecycle plugin needs to re-record the load change. The optimal scheduling effect can be achieved only after several periods of statistics are collected.
Gang	Consider a group of pods as a whole for resource allocation. This plugin checks whether the number of scheduled pods in a job meets the minimum requirements for running the job. If yes, all pods in the job will be scheduled. If no, the pods will not be scheduled. NOTE: If a gang scheduling policy is used and the remaining resources in the cluster are greater than or equal to half of the minimum required resources but less than the minimum required resources, Autoscaler scale-outs will not be triggered.	enablePreemptable: true: Preemption is enabled. false: Preemption is not enabled. enableJobStarving: true: Resources are preempted based on the minAvailable setting of jobs. false: Resources are preempted based on job replicas. NOTE: The default value of minAvailable for Kubernetes-native workloads (such as Deployments) is 1. It is a good practice to set enableJobStarving to false. In AI and big data scenarios, you can specify the minAvailable value when creating a vcjob. It is a good practice to set enableJobStarving to true. In Volcano versions earlier than 1.11.5, enableJobStarving is set to true by default. In Volcano versions later than 1.11.5, enableJobStarving is set to false by default.	- plugins: - name: priority - name: gang enablePreemptable: false enableJobStarving: false - name: conformance
priority	Schedule based on custom load priorities.	None	- plugins: - name: priority - name: gang enablePreemptable: false - name: conformance
overcommit	Cluster resources are multiplied by a specific factor before scheduling, enhancing the efficiency of workload queues. If all workloads are Deployments, remove this plugin or set the overcommit factor to 2.0. NOTE: This plugin is supported in Volcano 1.6.5 and later versions.	arguments: overcommit-factor: overcommit factor, which defaults to 1.2.	- plugins: - name: overcommit arguments: overcommit-factor: 2.0
drf	The Dominant Resource Fairness (DRF) scheduling algorithm, which schedules jobs based on their dominant resource share. Jobs with a smaller resource share will be scheduled with a higher priority.	None	- plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder'
predicates	A common algorithm for node selection. This algorithm includes a series of basic algorithms, such as node affinity, pod affinity, taints and tolerations, node port conflict checks, volume limits, and volume zone matching.	None	- plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder'
nodeorder	A common algorithm for selecting nodes. Nodes are scored in simulated resource allocation to find the most suitable node for the current job.	Scoring parameters: nodeaffinity.weight: Pods are scheduled based on node affinity. This parameter defaults to 2. podaffinity.weight: Pods are scheduled based on pod affinity. This parameter defaults to 2. leastrequested.weight: Pods are scheduled to the node with the least requested resources. This parameter defaults to 1. balancedresource.weight: Pods are scheduled to the node with balanced resource allocation. This parameter defaults to 1. mostrequested.weight: Pods are scheduled to the node with the most requested resources. This parameter defaults to 0. tainttoleration.weight: Pods are scheduled to the node with a high taint tolerance. This parameter defaults to 3. imagelocality.weight: Pods are scheduled to the node where the required images exist. This parameter defaults to 1. podtopologyspread.weight: Pods are scheduled based on the pod topology. This parameter defaults to 2.	- plugins: - name: nodeorder arguments: leastrequested.weight: 1 mostrequested.weight: 0 nodeaffinity.weight: 2 podaffinity.weight: 2 balancedresource.weight: 1 tainttoleration.weight: 3 imagelocality.weight: 1 podtopologyspread.weight: 2
cce-gpu-topology-predicate	GPU-topology scheduling preselection algorithm	None	- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'xgpu'
cce-gpu-topology-priority	GPU-topology scheduling priority algorithm	None	- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'xgpu'
cce-gpu	GPU resource allocation that supports decimal GPU configurations by working with the CCE AI Suite (NVIDIA GPU) add-on. NOTE: The add-on of version 1.10.5 or later does not support this configuration. Use xGPU instead. The prerequisite for configuring decimal GPUs is that the GPU resources on the GPU nodes can be shared by pods. For details about how to check whether GPU sharing is disabled in the cluster, see the enable-gpu-share parameter in Modifying Cluster Configurations.	None	- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu'
xgpu	Allocate virtual GPUs.	None	- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'xgpu'
numa-aware	NUMA affinity scheduling. For details, see NUMA Affinity Scheduling.	arguments: weight: weight of the numa-aware plugin	- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' arguments: NetworkType: vpc-router - name: numa-aware arguments: weight: 10
networkresource	Filter out nodes that require network interfaces. The parameters are transferred by CCE and do not need to be manually configured.	arguments: NetworkType: network type (eni or vpc-router)	- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' arguments: NetworkType: vpc-router
nodelocalvolume	Filter out nodes that do not meet local volume requirements.	None	- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource'
nodeemptydirvolume	Filter out nodes that do not meet the emptyDir requirements.	None	- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource'
nodeCSIscheduling	Filter out nodes with malfunctional Everest.	None	- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource'

Components

**Table 6** Add-on components
Component	Description	Resource Type
volcano-scheduler	Schedule pods.	Deployment
volcano-controller	Synchronize CRDs.	Deployment
volcano-admission	Webhook server, which verifies and modifies resources such as pods and jobs.	Deployment
volcano-agent	Cloud native hybrid agent, which is used for node QoS assurance, CPU burst, and dynamic resource oversubscription.	DaemonSet
resource-exporter	Report the NUMA topology information of nodes.	DaemonSet
volcano-descheduler	Reschedule pods in a cluster. After descheduling is enabled, pods will be automatically deployed on nodes.	Deployment
volcano-recommender	Generate recommendations for CPU and memory requests based on the historical CPU and memory usage of a container.	Deployment
volcano-recommender-prometheus-adapter	Collect historical CPU and memory metrics of containers from Prometheus.	Deployment
kthena-controller-manager	Deploy and manage foundation model inference workloads in production environments.	Deployment

Modifying the volcano-scheduler Configurations Using the Console

volcano-scheduler is a pod scheduling component, which consists of a series of actions and plugins. Actions should be executed in every step. Plugins provide the action algorithm details in different scenarios. volcano-scheduler is highly scalable. You can specify actions and plugins as needed.

After the add-on is installed, you can choose Settings in the navigation pane, switch to the Scheduling tab, and configure the basic scheduling capabilities. You can also use the expert mode of Volcano Scheduler to customize advanced scheduling policies based on service scenarios.

This section describes how to configure volcano-scheduler.

Only Volcano Scheduler of 1.7.1 and later support this function.

Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Settings and click the Scheduling tab. In the Default Cluster Scheduler area, select Volcano, find the expert mode, and click Try Now.

Click to enlarge

Using resource_exporter:

...
    "default_scheduler_conf": {
        "actions": "allocate, backfill, preempt",
        "tiers": [
            {
                "plugins": [
                    {
                        "name": "priority"
                    },
                    {
                        "name": "gang"
                    },
                    {
                        "name": "conformance"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "drf"
                    },
                    {
                        "name": "predicates"
                    },
                    {
                        "name": "nodeorder"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "cce-gpu-topology-predicate"
                    },
                    {
                        "name": "cce-gpu-topology-priority"
                    },
                    {
                        "name": "cce-gpu"
                    },
                    {
                        "name": "numa-aware" # add this also enable resource_exporter
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "nodelocalvolume"
                    },
                    {
                        "name": "nodeemptydirvolume"
                    },
                    {
                        "name": "nodeCSIscheduling"
                    },
                    {
                        "name": "networkresource"
                    }
                ]
            }
        ]
    },
...

After this function is enabled, you can use the functions of both numa-aware and resource_exporter.

Collecting Prometheus Metrics

volcano-scheduler exposes Prometheus metrics through port 8080. You can build a Prometheus collector to identify and obtain volcano-scheduler scheduling metrics from http://{{volcano-schedulerPodIP}}:{{volcano-schedulerPodPort}}/metrics.

Prometheus metrics can be exposed only by the Volcano Scheduler add-on of version 1.8.5 or later.

**Table 7** Key metrics
Metric	Type	Description	Label
e2e_scheduling_latency_milliseconds	Histogram	E2E scheduling latency (ms) (scheduling algorithm + binding)	None
e2e_job_scheduling_latency_milliseconds	Histogram	E2E job scheduling latency (ms)	None
e2e_job_scheduling_duration	Gauge	E2E job scheduling duration	labels=["job_name", "queue", "job_namespace"]
plugin_scheduling_latency_microseconds	Histogram	Add-on scheduling latency (µs)	labels=["plugin", "OnSession"]
action_scheduling_latency_microseconds	Histogram	Action scheduling latency (µs)	labels=["action"]
task_scheduling_latency_milliseconds	Histogram	Task scheduling latency (ms)	None
schedule_attempts_total	Counter	Number of pod scheduling attempts. unschedulable indicates that the pods cannot be scheduled, and error indicates that the internal scheduler is faulty.	labels=["result"]
pod_preemption_victims	Gauge	Number of selected preemption victims	None
total_preemption_attempts	Counter	Total number of preemption attempts in a cluster	None
unschedule_task_count	Gauge	Number of unschedulable tasks	labels=["job_id"]
unschedule_job_count	Gauge	Number of unschedulable jobs	None
job_retry_counts	Counter	Number of job retries	labels=["job_id"]

Uninstalling the Volcano Scheduler Add-on

After the add-on is uninstalled, all custom Volcano resources (Table 8) will be deleted, including the created resources. Reinstalling the add-on will not inherit or restore the tasks before the uninstallation. It is a good practice to uninstall the Volcano Scheduler add-on only when no custom Volcano resources are being used in the cluster.

**Table 8** Custom Volcano resources
Item	API Group	API Version	Resource Level
Command	bus.volcano.sh	v1alpha1	Namespace
Job	batch.volcano.sh	v1alpha1	Namespace
Numatopology	nodeinfo.volcano.sh	v1alpha1	Cluster
PodGroup	scheduling.volcano.sh	v1beta1	Namespace
Queue	scheduling.volcano.sh	v1beta1	Cluster
BalancerPolicyTemplate	autoscaling.volcano.sh	v1alpha1	Cluster
Balancer	autoscaling.volcano.sh	v1alpha1	Cluster

BalancerPolicyTemplate and Balancer resources are created only after the application scaling priority policies are enabled. For details, see Application Scaling Priority Policies.

Related Operations

Release History

It is a good practice to upgrade Volcano Scheduler to the latest version that is supported by the cluster.

**Table 9** Volcano Scheduler updates
Add-on Version	Supported Cluster Version	New Feature
1.22.11	v1.29 v1.30 v1.31 v1.32 v1.33 v1.34 v1.35 v1.36	Supported Kubernetes v1.36.
1.22.5	v1.29 v1.30 v1.31 v1.32 v1.33 v1.34 v1.35	Enhanced multi-tier network topology-aware affinity scheduling. Enhanced multi-dimensional gang scheduling. Optimized online/offline hybrid deployment.
1.21.25	v1.29 v1.30 v1.31 v1.32 v1.33 v1.34 v1.35	Supported Kubernetes v1.35.
1.21.7	v1.28 v1.29 v1.30 v1.31 v1.32 v1.33 v1.34	Optimized the multi-level hypernode bin packing capability.
1.20.20	v1.28 v1.29 v1.30 v1.31 v1.32 v1.33 v1.34	Fixed some issues.
1.20.6	v1.28 v1.29 v1.30 v1.31 v1.32 v1.33 v1.34	Supported multi-dimensional parallelism and collaborative scheduling across multi-level network topologies for training tasks. Supported for dynamic resource sharing and reclamation between queues. Supported hierarchical queue management.
1.19.23	v1.28 v1.29 v1.30 v1.31 v1.32 v1.33 v1.34	Supported CCE clusters v1.34.
1.19.13	v1.28 v1.29 v1.30 v1.31 v1.32 v1.33	Added support for pod compaction in Ascend dual-die NPU topology scenarios. Added support for two-level group network topology-aware scheduling in training jobs.
1.19.6	v1.27 v1.28 v1.29 v1.30 v1.31 v1.32 v1.33	Added support for scalable scheduling of logical NPU pools. Added support for two-level group network topology-aware scheduling in training jobs.
1.18.15	v1.27 v1.28 v1.29 v1.30 v1.31 v1.32 v1.33	Supported CCE clusters v1.33.
1.18.3	v1.27 v1.28 v1.29 v1.30 v1.31 v1.32	Supported pod compaction based on NPU resources. Supported multi-xGPU preemption.
1.17.11	v1.25 v1.27 v1.28 v1.29 v1.30 v1.31 v1.32	Supported CCE clusters v1.32.
1.16.17	v1.25 v1.27 v1.28 v1.29 v1.30 v1.31	Supported even scheduling in virtual GPUs.
1.16.8	v1.25 v1.27 v1.28 v1.29 v1.30 v1.31	Supported CCE clusters v1.31. Optimized the resource scheduling capability of hypernodes.
1.15.11	v1.23 v1.25 v1.27 v1.28 v1.29 v1.30	Fixed some issues.
1.15.10	v1.23 v1.25 v1.27 v1.28 v1.29 v1.30	Supported tor packing and scheduling. Optimized the NPU dual-die topology scheduling.
1.15.8	v1.23 v1.25 v1.27 v1.28 v1.29 v1.30	Supported the NPU dual-die affinity scheduling.
1.15.6	v1.23 v1.25 v1.27 v1.28 v1.29 v1.30	Resources can be oversubscribed based on pod profiling.
1.14.11	v1.23 v1.25 v1.27 v1.28 v1.29 v1.30	The hypernode resource scheduling model (HyperJob) is available. Supported hypernode affinity scheduling. Supported Kubernetes v1.30.
1.13.7	v1.21 v1.23 v1.25 v1.27 v1.28 v1.29	Network interfaces can be pre-bound for scheduling. The resource oversubscription ratio can be customized.
1.13.3	v1.21 v1.23 v1.25 v1.27 v1.28 v1.29	Supported scale-in of custom resources based on node priorities. Optimized the association between preemption and node scale-out.
1.13.1	v1.21 v1.23 v1.25 v1.27 v1.28 v1.29	Optimized scheduler memory usage.
1.12.18	v1.21 v1.23 v1.25 v1.27 v1.28 v1.29	Supported CCE clusters v1.29. The preemption function is enabled by default.
1.12.1	v1.19.16 v1.21 v1.23 v1.25 v1.27 v1.28	Optimized application auto scaling performance.
1.11.21	v1.19.16 v1.21 v1.23 v1.25 v1.27 v1.28	Supported Kubernetes v1.28. Supported load-aware scheduling. Changed the image OS to Huawei Cloud EulerOS 2.0. Optimized CSI resource preemption. Optimized load-aware rescheduling. Optimized preemption in hybrid deployment scenarios.
1.11.6	v1.19.16 v1.21 v1.23 v1.25 v1.27	Supported Kubernetes v1.27. Supported rescheduling. Supported affinity scheduling of nodes in node pools. Optimized the scheduling performance.
1.10.7	v1.19.16 v1.21 v1.23 v1.25	Fixed the issue where the local PV add-on fails to calculate the number of pods pre-bound to nodes.
1.10.5	v1.19.16 v1.21 v1.23 v1.25	The volcano agent supports resource oversubscription. Added the verification admission for GPUs. The value of nvidia.com/gpu must be less than 1 or a positive integer, and the value of volcano.sh/gpu-core.percentage must be a multiple of 5 and less than 100. Fixed the issue where pod scheduling is slow when PVC binding fails. Fixed the issue where newly added pods cannot run when there are terminating pods on a node for a long time. Fixed the issue where Volcano restarts during concurrent PVC creation or mounting to pods.
1.9.1	v1.19.16 v1.21 v1.23 v1.25	Fixed the issue where the counting pipeline pod of the networkresource plugin occupies supplementary network interfaces. Fixed the issue where the binpack plugin scores nodes with insufficient resources. Fixed the issue of processing resources in the pod with an unknown end status. Optimized event output. Supported HA deployment by default.
1.7.2	v1.19.16 v1.21 v1.23 v1.25	Supported clusters v1.25. Improved the scheduling performance of Volcano.
1.7.1	v1.19.16 v1.21 v1.23 v1.25	Supported clusters v1.25.
1.4.7	v1.15 v1.17 v1.19 v1.21	Deleted the pod status Undetermined to adapt to cluster Autoscaler.
1.4.5	v1.17 v1.19 v1.21	Changed the deployment mode of volcano-scheduler from StatefulSet to Deployment, and fixed the issue where pods cannot be automatically migrated when the node is abnormal.
1.4.2	v1.15 v1.17 v1.19 v1.21	Resolved the issue that cross-GPU allocation fails in certain scenarios. Supported the updated EAS API.
1.3.7	v1.15 v1.17 v1.19 v1.21	Supported hybrid deployment of online and offline jobs and resource oversubscription. Optimized the scheduling throughput for clusters. Fixed the issue where the scheduler panics in certain scenarios. Fixed the issue where the volumes.secret verification of the volcano job in the CCE clusters v1.15 fails. Fixed the issue where jobs fail to be scheduled when volumes are mounted.
1.3.3	v1.15 v1.17 v1.19 v1.21	Fixed the scheduler crash due to a GPU error and a privileged init container admission failure.
1.3.1	v1.15 v1.17 v1.19	Upgraded the Volcano framework to the latest version. Supported Kubernetes v1.19. Added the numa-aware plugin. Fixed the Deployment scaling issue in the multi-queue scenario. Adjusted the algorithm plugin that is enabled by default.
1.2.5	v1.15 v1.17 v1.19	Fixed the OutOfcpu issue in some scenarios. Fixed the issue where pods cannot be scheduled when some capabilities are set for a queue. Made the log time of the volcano component consistent with the system time. Fixed the issue of preemption between multiple queues. Fixed the issue where the ioaware plugin cannot behave as expected in some scenarios. Supported hybrid clusters.
1.2.3	v1.15 v1.17 v1.19	Fixed the training task OOM issue caused by inaccuracy. Fixed the GPU scheduling issue in CCE clusters v1.15 and later versions. Rolling upgrades of CCE versions during task distribution are not supported. Fixed the issue where the queue status is unknown in certain scenarios. Fixed the issue where a panic occurred when a PVC is mounted to a job in a specific scenario. Fixed the issue where decimals cannot be configured for GPU jobs. Added the ioaware plugin. Added ring controller.