Volcano Scheduler
Introduction
Volcano is a batch processing platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.
Volcano provides general computing capabilities such as high-performance job scheduling, heterogeneous chip management, and job running management. It accesses the computing frameworks for various industries such as AI, big data, gene, and rendering and schedules up to 1000 pods per second for end users, greatly improving scheduling efficiency and resource utilization.
Volcano provides job scheduling, job management, and queue management for computing applications. Its main features are as follows:
- Diverse computing frameworks, such as TensorFlow, MPI, and Spark, can run on Kubernetes in containers. Common APIs for batch computing jobs through CRD, various plugins, and advanced job lifecycle management are provided.
- Advanced scheduling capabilities are provided for batch computing and high-performance computing scenarios, including group scheduling, preemptive priority scheduling, packing, resource reservation, and task topology.
- Queues can be effectively managed for scheduling jobs. Complex job scheduling capabilities such as queue priority and multi-level queues are supported.
Volcano has been open-sourced in GitHub at https://github.com/volcano-sh/volcano.
Install and configure the Volcano add-on in CCE clusters. For details, see Volcano Scheduling.
When using Volcano as a scheduler, use it to schedule all workloads in the cluster. This prevents resource scheduling conflicts caused by simultaneous working of multiple schedulers.
Installing the Add-on
- Log in to the CCE console and click the cluster name to access the cluster console. Choose Add-ons in the navigation pane, locate Volcano Scheduler on the right, and click Install.
- On the Install Add-on page, configure the specifications.
Table 1 Add-on configuration Parameter
Description
Add-on Specifications
Select Standalone, Custom, or HA for Add-on Specifications.
Pods
Number of pods that will be created to match the selected add-on specifications.
If you selected Custom, you can adjust the number of pods as needed.
High availability is not possible with a single pod. If an error occurs on the node where the add-on instance runs, the add-on will fail.
Containers
CPU and memory quotas of the container allowed for the selected add-on specifications.
If you select Custom, the recommended values for volcano-controller and volcano-scheduler are as follows:
- If the number of nodes is less than 100, retain the default configuration. The requested vCPUs are 500m, and the limit is 2000m. The requested memory is 500 MiB, and the limit is 2000 MiB.
- If the number of nodes is greater than 100, increase the requested vCPUs by 500m and the requested memory by 1000 MiB each time 100 nodes (10,000 pods) are added. Increase the vCPU limit by 1500m and the memory limit by 1000 MiB.
NOTE:
Recommended formula for calculating the requested value:
- Requested vCPUs: Calculate the number of target nodes multiplied by the number of target pods, perform interpolation search based on the number of nodes in the cluster multiplied by the number of target pods in Table 2, and round up the request value and limit value that are closest to the specifications.
For example, for 2000 nodes and 20,000 pods, Number of target nodes x Number of target pods = 40 million, which is close to the specification of 700/70,000 (Number of cluster nodes x Number of pods = 49 million). According to the following table, set the requested vCPUs to 4000m and the limit value to 5500m.
- Requested memory: It is recommended that 2.4 GiB memory be allocated to every 1000 nodes and 1 GiB memory be allocated to every 10,000 pods. The requested memory is the sum of these two values. (The obtained value may be different from the recommended value in Table 2. You can use either of them.)
Requested memory = Number of target nodes/1000 x 2.4 GiB + Number of target pods/10,000 x 1 GiB
For example, for 2000 nodes and 20,000 pods, the requested memory is 6.8 GiB (2000/1000 x 2.4 GiB + 20,000/10,000 x 1 GiB).
- Requested vCPUs: Calculate the number of target nodes multiplied by the number of target pods, perform interpolation search based on the number of nodes in the cluster multiplied by the number of target pods in Table 2, and round up the request value and limit value that are closest to the specifications.
Table 2 Recommended values for volcano-controller and volcano-scheduler Nodes/Pods in a Cluster
Requested vCPUs (m)
vCPU Limit (m)
Requested Memory (MiB)
Memory Limit (MiB)
50/5000
500
2000
500
2000
100/10,000
1000
2500
1500
2500
200/20,000
1500
3000
2500
3500
300/30,000
2000
3500
3500
4500
400/40,000
2500
4000
4500
5500
500/50,000
3000
4500
5500
6500
600/60,000
3500
5000
6500
7500
700/70,000
4000
5500
7500
8500
- Configure the add-on parameters.
- Application Scaling Priority Policy: After this function is enabled, application scale-in is performed based on the default priority policy and customized policies. If application scale-out is required, you need to set the default scheduler of the cluster to volcano. For details, see Application Scaling Priority Policies.
- Advanced Settings: You can configure the default scheduler parameters. For details, see Table 4.
Example:colocation_enable: '' default_scheduler_conf: actions: 'allocate, backfill, preempt' tiers: - plugins: - name: 'priority' - name: 'gang' - name: 'conformance' - name: 'lifecycle' arguments: lifecycle.MaxGrade: 10 lifecycle.MaxScore: 200.0 lifecycle.SaturatedTresh: 1.0 lifecycle.WindowSize: 10 - plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder' - plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu' - plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 60 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 60
Table 3 Advanced Volcano configuration parameters Plugin
Function
Description
Demonstration
colocation_enable
Whether to enable hybrid deployment.
Value:
- true: hybrid enabled
- false: hybrid disabled
None
default_scheduler_conf
Used to schedule pods. It consists of a series of actions and plugins and features high scalability. You can specify and implement actions and plugins based on your requirements.
It consists of actions and tiers.
- actions: defines the types and sequence of actions to be executed by the scheduler.
- tiers: configures the plugin list.
None
actions
Actions to be executed in each scheduling phase. The configured action sequence is the scheduler execution sequence. For details, see Actions.
The scheduler traverses all jobs to be scheduled and performs actions such as enqueue, allocate, preempt, and backfill in the configured sequence to find the most appropriate node for each job.
The following options are supported:
- enqueue: uses a series of filtering algorithms to filter out tasks to be scheduled and sends them to the queue to wait for scheduling. After this action, the task status changes from pending to inqueue.
- allocate: selects the most suitable node based on a series of pre-selection and selection algorithms.
- preempt: performs preemption scheduling for tasks with higher priorities in the same queue based on priority rules.
- backfill: schedules pending tasks as much as possible to maximize the utilization of node resources.
actions: 'allocate, backfill, preempt'
NOTE:When configuring actions, use either preempt or enqueue.
plugins
Implementation details of algorithms in actions based on different scenarios. For details, see Plugins.
For details, see Table 4.
None
tolerations
Tolerance of the add-on to node taints.
By default, the add-on can run on nodes with the node.kubernetes.io/not-ready or node.kubernetes.io/unreachable taint and the taint effect value is NoExecute, but it'll be evicted in 60 seconds.
tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 60 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 60
Table 4 Supported plugins Plugin
Function
Description
Demonstration
binpack
Schedule pods to nodes with high resource usage (not allocating pods to light-loaded nodes) to reduce resource fragments.
arguments:
- binpack.weight: weight of the binpack plugin.
- binpack.cpu: ratio of CPUs to all resources. The parameter value defaults to 1.
- binpack.memory: ratio of memory resources to all resources. The parameter value defaults to 1.
- binpack.resources: other custom resource types requested by the pod, for example, nvidia.com/gpu. Multiple types can be configured and be separated by commas (,).
- binpack.resources.<your_resource>: weight of your custom resource in all resources. Multiple types of resources can be added. <your_resource> indicates the resource type defined in binpack.resources, for example, binpack.resources.nvidia.com/gpu.
- plugins: - name: binpack arguments: binpack.weight: 10 binpack.cpu: 1 binpack.memory: 1 binpack.resources: nvidia.com/gpu, example.com/foo binpack.resources.nvidia.com/gpu: 2 binpack.resources.example.com/foo: 3
conformance
Prevent key pods, such as the pods in the kube-system namespace from being preempted.
None
- plugins: - name: 'priority' - name: 'gang' enablePreemptable: false - name: 'conformance'
lifecycle
By collecting statistics on service scaling rules, pods with similar lifecycles are preferentially scheduled to the same node. With the horizontal scaling capability of the Autoscaler, resources can be quickly scaled in and released, reducing costs and improving resource utilization.
1. Collects statistics on the lifecycle of pods in the service load and schedules pods with similar lifecycles to the same node.
2. For a cluster configured with an automatic scaling policy, adjust the scale-in annotation of the node to preferentially scale in the node with low usage.
arguments:- lifecycle.WindowSize: The value is an integer greater than or equal to 1 and defaults to 10.
Record the number of times that the number of replicas changes. If the load changes regularly and periodically, decrease the value. If the load changes irregularly and the number of replicas changes frequently, increase the value. If the value is too large, the learning period is prolonged and too many events are recorded.
- lifecycle.MaxGrade: The value is an integer greater than or equal to 3 and defaults to 3.
It indicates levels of replicas. For example, if the value is set to 3, the replicas are classified into three levels. If the load changes regularly and periodically, decrease the value. If the load changes irregularly, increase the value. Setting an excessively small value may result in inaccurate lifecycle forecasts.
- lifecycle.MaxScore: float64 floating point number. The value must be greater than or equal to 50.0. The default value is 200.0.
Maximum score (equivalent to the weight) of the lifecycle plugin.
- lifecycle.SaturatedTresh: float64 floating point number. If the value is less than 0.5, use 0.5. If the value is greater than 1, use 1. The default value is 0.8.
Threshold for determining whether the node usage is too high. If the node usage exceeds the threshold, the scheduler preferentially schedules jobs to other nodes.
- plugins: - name: priority - name: gang enablePreemptable: false - name: conformance - name: lifecycle arguments: lifecycle.MaxGrade: 3 lifecycle.MaxScore: 200.0 lifecycle.SaturatedTresh: 0.8 lifecycle.WindowSize: 10
NOTE:- For nodes that do not want to be scaled in, manually mark them as long-period nodes and add the annotation volcano.sh/long-lifecycle-node: true to them. For an unmarked node, the lifecycle plugin automatically marks the node based on the lifecycle of the load on the node.
- The default value of MaxScore is 200.0, which is twice the weight of other plugins. When the lifecycle plugin does not have obvious effect or conflicts with other plugins, disable other plugins or increase the value of MaxScore.
- After the scheduler is restarted, the lifecycle plugin needs to re-record the load change. The optimal scheduling effect can be achieved only after several periods of statistics are collected.
Gang
Consider a group of pods as a whole for resource allocation. This plugin checks whether the number of scheduled pods in a job meets the minimum requirements for running the job. If yes, all pods in the job will be scheduled. If no, the pods will not be scheduled.
NOTE:If a gang scheduling policy is used, if the remaining resources in the cluster are greater than or equal to half of the minimum number of resources for running a job but less than the minimum of resources for running the job, Autoscaler scale-outs will not be triggered.
- enablePreemptable:
- true: Preemption enabled
- false: Preemption not enabled
- enableJobStarving:
- true: Resources are preempted based on the minAvailable setting of jobs.
- false: Resources are preempted based on job replicas.
NOTE:- The default value of minAvailable for Kubernetes-native workloads (such as Deployments) is 1. It is a good practice to set enableJobStarving to false.
- In AI and big data scenarios, you can specify the minAvailable value when creating a vcjob. It is a good practice to set enableJobStarving to true.
- In Volcano versions earlier than v1.11.5, enableJobStarving is set to true by default. In Volcano versions later than v1.11.5, enableJobStarving is set to false by default.
- plugins: - name: priority - name: gang enablePreemptable: false enableJobStarving: false - name: conformance
priority
Schedule based on custom load priorities.
None
- plugins: - name: priority - name: gang enablePreemptable: false - name: conformance
overcommit
Resources in a cluster are scheduled after being accumulated in a certain multiple to improve the workload enqueuing efficiency. If all workloads are Deployments, remove this plugin or set the raising factor to 2.0.
NOTE:This plugin is supported in Volcano 1.6.5 and later versions.
arguments:
- overcommit-factor: inflation factor, which defaults to 1.2.
- plugins: - name: overcommit arguments: overcommit-factor: 2.0
drf
The Dominant Resource Fairness (DRF) scheduling algorithm, which schedules jobs based on their dominant resource share. Jobs with a smaller resource share will be scheduled with a higher priority.
-
- plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder'
predicates
Determine whether a task is bound to a node by using a series of evaluation algorithms, such as node/pod affinity, taint tolerance, node repetition, volume limits, and volume zone matching.
None
- plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder'
nodeorder
A common algorithm for selecting nodes. Nodes are scored in simulated resource allocation to find the most suitable node for the current job.
Scoring parameters:
- nodeaffinity.weight: Pods are scheduled based on node affinity. This parameter defaults to 2.
- podaffinity.weight: Pods are scheduled based on pod affinity. This parameter defaults to 2.
- leastrequested.weight: Pods are scheduled to the node with the least requested resources. This parameter defaults to 1.
- balancedresource.weight: Pods are scheduled to the node with balanced resource allocation. This parameter defaults to 1.
- mostrequested.weight: Pods are scheduled to the node with the most requested resources. This parameter defaults to 0.
- tainttoleration.weight: Pods are scheduled to the node with a high taint tolerance. This parameter defaults to 3.
- imagelocality.weight: Pods are scheduled to the node where the required images exist. This parameter defaults to 1.
- podtopologyspread.weight: Pods are scheduled based on the pod topology. This parameter defaults to 2.
- plugins: - name: nodeorder arguments: leastrequested.weight: 1 mostrequested.weight: 0 nodeaffinity.weight: 2 podaffinity.weight: 2 balancedresource.weight: 1 tainttoleration.weight: 3 imagelocality.weight: 1 podtopologyspread.weight: 2
cce-gpu-topology-predicate
GPU-topology scheduling preselection algorithm
None
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu'
cce-gpu-topology-priority
GPU-topology scheduling priority algorithm
None
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu'
cce-gpu
GPU resource allocation that supports decimal GPU configurations by working with the gpu add-on.
NOTE:- The plugin of version 1.10.5 or later does not support this add-on. Use xGPU instead.
- The prerequisite for configuring decimal GPUs is that the GPU nodes in the cluster are in shared mode. For details about how to check whether GPU sharing is disabled in the cluster, see the enable-gpu-share parameter in Modifying Cluster Configurations.
None
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu'
numa-aware
NUMA affinity scheduling. For details, see NUMA Affinity Scheduling.
arguments:
- weight: weight of the numa-aware plugin
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' arguments: NetworkType: vpc-router - name: numa-aware arguments: weight: 10
networkresource
The ENI requirement node can be preselected and filtered. The parameters are transferred by CCE and do not need to be manually configured.
arguments:
- NetworkType: network type (eni or vpc-router)
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' arguments: NetworkType: vpc-router
nodelocalvolume
Filter out nodes that do not meet local volume requirements.
None
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource'
nodeemptydirvolume
Filter out nodes that do not meet the emptyDir requirements.
None
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource'
nodeCSIscheduling
Filter out nodes with malfunctional Everest.
None
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource'
- Configure scheduling policies for the add-on.
- Scheduling policies do not take effect on add-on instances of the DaemonSet type.
- When configuring multi-AZ deployment or node affinity, ensure that there are nodes meeting the scheduling policy and that resources are sufficient in the cluster. Otherwise, the add-on cannot run.
Table 5 Configurations for add-on scheduling Parameter
Description
Multi AZ
- Preferred: Deployment pods of the add-on will be preferentially scheduled to nodes in different AZs. If all the nodes in the cluster are deployed in the same AZ, the pods will be scheduled to different nodes in that AZ.
- Required: Deployment pods of the add-on are forcibly scheduled to nodes in different AZs. There can be at most one pod in each AZ. If nodes in a cluster are not in different AZs, some add-on pods cannot run properly. If a node is faulty, add-on pods on it may fail to be migrated.
Node Affinity
- Not configured: Node affinity is disabled for the add-on.
- Node Affinity: Specify the nodes where the add-on is deployed. If you do not specify the nodes, the add-on will be randomly scheduled based on the default cluster scheduling policy.
- Specified Node Pool Scheduling: Specify the node pool where the add-on is deployed. If you do not specify the node pool, the add-on will be randomly scheduled based on the default cluster scheduling policy.
- Custom Policies: Enter the labels of the nodes where the add-on is to be deployed for more flexible scheduling policies. If you do not specify node labels, the add-on will be randomly scheduled based on the default cluster scheduling policy.
If multiple custom affinity policies are configured, ensure that there are nodes that meet all the affinity policies in the cluster. Otherwise, the add-on cannot run.
Toleration
Using both taints and tolerations allows (not forcibly) the add-on Deployment to be scheduled to a node with the matching taints, and controls the Deployment eviction policies after the node where the Deployment is located is tainted.
The add-on adds the default tolerance policy for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints, respectively. The tolerance time window is 60s.
For details, see Configuring Tolerance Policies.
- Click Install.
Components
Component |
Description |
Resource Type |
---|---|---|
volcano-scheduler |
Schedule pods. |
Deployment |
volcano-controller |
Synchronize CRDs. |
Deployment |
volcano-admission |
Webhook server, which verifies and modifies resources such as pods and jobs |
Deployment |
volcano-agent |
Cloud native hybrid agent, which is used for node QoS assurance, CPU burst, and dynamic resource oversubscription |
DaemonSet |
resource-exporter |
Report the NUMA topology information of nodes. |
DaemonSet |
volcano-descheduler |
Reschedule pods in a cluster. After the rescheduling capability is enabled, pods will be automatically deployed on nodes. |
Deployment |
Modifying the volcano-scheduler Configurations Using the Console
volcano-scheduler is the component responsible for pod scheduling. It consists of a series of actions and plugins. Actions should be executed in every step. Plugins provide the action algorithm details in different scenarios. volcano-scheduler is highly scalable. You can specify and implement actions and plugins based on your requirements.
Volcano allows you to configure the scheduler during installation, upgrade, and editing. The configuration will be synchronized to volcano-scheduler-configmap.
This section describes how to configure volcano-scheduler.
Only Volcano of v1.7.1 and later support this function. On the new add-on page, options such as resource_exporter_enable are replaced by default_scheduler_conf.
Log in to the CCE console and click the cluster name to access the cluster console. Choose Add-ons in the navigation pane. On the right of the page, locate Volcano Scheduler and click Install or Upgrade. In the Parameters area, configure the Volcano parameters.
- Using resource_exporter:
... "default_scheduler_conf": { "actions": "allocate, backfill, preempt", "tiers": [ { "plugins": [ { "name": "priority" }, { "name": "gang" }, { "name": "conformance" } ] }, { "plugins": [ { "name": "drf" }, { "name": "predicates" }, { "name": "nodeorder" } ] }, { "plugins": [ { "name": "cce-gpu-topology-predicate" }, { "name": "cce-gpu-topology-priority" }, { "name": "cce-gpu" }, { "name": "numa-aware" # add this also enable resource_exporter } ] }, { "plugins": [ { "name": "nodelocalvolume" }, { "name": "nodeemptydirvolume" }, { "name": "nodeCSIscheduling" }, { "name": "networkresource" } ] } ] }, ...
After this function is enabled, you can use the functions of both numa-aware and resource_exporter.
Retaining the Original volcano-scheduler-configmap Configurations
If you want to use the original configuration after the plugin is upgraded, perform the following steps:
- Check and back up the original volcano-scheduler-configmap configuration.
Example:
# kubectl edit cm volcano-scheduler-configmap -n kube-system apiVersion: v1 data: default-scheduler.conf: |- actions: "enqueue, allocate, backfill" tiers: - plugins: - name: priority - name: gang - name: conformance - plugins: - name: drf - name: predicates - name: nodeorder - name: binpack arguments: binpack.cpu: 100 binpack.weight: 10 binpack.resources: nvidia.com/gpu binpack.resources.nvidia.com/gpu: 10000 - plugins: - name: cce-gpu-topology-predicate - name: cce-gpu-topology-priority - name: cce-gpu - plugins: - name: nodelocalvolume - name: nodeemptydirvolume - name: nodeCSIscheduling - name: networkresource
- Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Add-ons. On the right of the page, locate Volcano Scheduler and click Install or Edit. In the Parameters area, modify the advanced settings.
- Enter the customized settings:
... "default_scheduler_conf": { "actions": "enqueue, allocate, backfill", "tiers": [ { "plugins": [ { "name": "priority" }, { "name": "gang" }, { "name": "conformance" } ] }, { "plugins": [ { "name": "drf" }, { "name": "predicates" }, { "name": "nodeorder" }, { "name": "binpack", "arguments": { "binpack.cpu": 100, "binpack.weight": 10, "binpack.resources": "nvidia.com/gpu", "binpack.resources.nvidia.com/gpu": 10000 } } ] }, { "plugins": [ { "name": "cce-gpu-topology-predicate" }, { "name": "cce-gpu-topology-priority" }, { "name": "cce-gpu" } ] }, { "plugins": [ { "name": "nodelocalvolume" }, { "name": "nodeemptydirvolume" }, { "name": "nodeCSIscheduling" }, { "name": "networkresource" } ] } ] }, ...
When this function is used, the original content in volcano-scheduler-configmap will be overwritten. Therefore, you must check whether volcano-scheduler-configmap has been modified during the upgrade. If yes, synchronize the modification to the upgrade page.
Collecting Prometheus Metrics
volcano-scheduler exposes Prometheus metrics through port 8080. You can build a Prometheus collector to identify and obtain volcano-scheduler scheduling metrics from http://{{volcano-schedulerPodIP}}:{{volcano-schedulerPodPort}}/metrics.
Prometheus metrics can be exposed only by the Volcano add-on of version 1.8.5 or later.
Metric |
Type |
Description |
Label |
---|---|---|---|
e2e_scheduling_latency_milliseconds |
Histogram |
E2E scheduling latency (ms) (scheduling algorithm + binding) |
None |
e2e_job_scheduling_latency_milliseconds |
Histogram |
E2E job scheduling latency (ms) |
None |
e2e_job_scheduling_duration |
Gauge |
E2E job scheduling duration |
labels=["job_name", "queue", "job_namespace"] |
plugin_scheduling_latency_microseconds |
Histogram |
Add-on scheduling latency (µs) |
labels=["plugin", "OnSession"] |
action_scheduling_latency_microseconds |
Histogram |
Action scheduling latency (µs) |
labels=["action"] |
task_scheduling_latency_milliseconds |
Histogram |
Task scheduling latency (ms) |
None |
schedule_attempts_total |
Counter |
Number of pod scheduling attempts. unschedulable indicates that the pods cannot be scheduled, and error indicates that the internal scheduler is faulty. |
labels=["result"] |
pod_preemption_victims |
Gauge |
Number of selected preemption victims |
None |
total_preemption_attempts |
Counter |
Total number of preemption attempts in a cluster |
None |
unschedule_task_count |
Gauge |
Number of unschedulable tasks |
labels=["job_id"] |
unschedule_job_count |
Gauge |
Number of unschedulable jobs |
None |
job_retry_counts |
Counter |
Number of job retries |
labels=["job_id"] |
Uninstalling the Volcano Add-on
After the add-on is uninstalled, all custom Volcano resources (Table 8) will be deleted, including the created resources. Reinstalling the add-on will not inherit or restore the tasks before the uninstallation. It is a good practice to uninstall the Volcano add-on only when no custom Volcano resources are being used in the cluster.
Item |
API Group |
API Version |
Resource Level |
---|---|---|---|
Command |
bus.volcano.sh |
v1alpha1 |
Namespaced |
Job |
batch.volcano.sh |
v1alpha1 |
Namespaced |
Numatopology |
nodeinfo.volcano.sh |
v1alpha1 |
Cluster |
PodGroup |
scheduling.volcano.sh |
v1beta1 |
Namespaced |
Queue |
scheduling.volcano.sh |
v1beta1 |
Cluster |
BalancerPolicyTemplate |
autoscaling.volcano.sh |
v1alpha1 |
Cluster |
Balancer |
autoscaling.volcano.sh |
v1alpha1 |
Cluster |
BalancerPolicyTemplate and Balancer resources are created only after the application scaling priority policies are enabled. For details, see Application Scaling Priority Policies.
Related Operations
Change History
It is a good practice to upgrade Volcano to the latest version that is supported by the cluster.
Add-on Version |
Supported Cluster Version |
New Feature |
---|---|---|
1.13.7 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
|
1.13.3 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
|
1.13.1 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
Optimized scheduler memory usage. |
1.12.18 |
v1.21 v1.23 v1.25 v1.27 v1.28 v1.29 |
|
1.12.1 |
v1.19.16 v1.21 v1.23 v1.25 v1.27 v1.28 |
Optimized application auto scaling performance. |
1.11.21 |
v1.19.16 v1.21 v1.23 v1.25 v1.27 v1.28 |
|
1.11.6 |
v1.19.16 v1.21 v1.23 v1.25 v1.27 |
|
1.10.7 |
v1.19.16 v1.21 v1.23 v1.25 |
Fixes the issue that the local PV add-on fails to calculate the number of pods pre-bound to the node. |
1.10.5 |
v1.19.16 v1.21 v1.23 v1.25 |
|
1.9.1 |
v1.19.16 v1.21 v1.23 v1.25 |
|
1.7.2 |
v1.19.16 v1.21 v1.23 v1.25 |
|
1.7.1 |
v1.19.16 v1.21 v1.23 v1.25 |
Adapts to clusters 1.25. |
1.4.7 |
v1.15 v1.17 v1.19 v1.21 |
Deletes the pod status Undetermined to adapt to cluster Autoscaler. |
1.4.5 |
v1.17 v1.19 v1.21 |
Changes the deployment mode of volcano-scheduler from statefulset to deployment, and fixes the issue that pods cannot be automatically migrated when the node is abnormal. |
1.4.2 |
v1.15 v1.17 v1.19 v1.21 |
|
1.3.7 |
v1.15 v1.17 v1.19 v1.21 |
|
1.3.3 |
v1.15 v1.17 v1.19 v1.21 |
Fixes the scheduler crash caused by GPU exceptions and the privileged init container admission failure. |
1.3.1 |
v1.15 v1.17 v1.19 |
|
1.2.5 |
v1.15 v1.17 v1.19 |
|
1.2.3 |
v1.15 v1.17 v1.19 |
|
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot