Updated on 2025-08-18 GMT+08:00

Volcano Scheduler

Description

Volcano is a batch scheduling platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.

Volcano provides general computing capabilities such as high-performance job scheduling, heterogeneous chip management, and job running management. It accesses the computing frameworks for various industries such as AI, big data, gene, and rendering and schedules up to 1,000 pods per second for end users, greatly improving scheduling efficiency and resource utilization.

Volcano provides job scheduling, job management, and queue management for computing applications. Its main features are as follows:

  • Diverse computing frameworks: CRD provides common APIs for batch computing tasks. With various plug-ins and advanced job lifecycle management, computing frameworks such as TensorFlow, MPI, and Spark can run on Kubernetes in containers.
  • Advanced scheduling: Advanced scheduling capabilities are provided for batch computing and high-performance computing scenarios, including group scheduling, priority preemption, packing, resource reservation, and task topology.
  • Queue management: Queues can be effectively managed for scheduling jobs. Complex job scheduling can be implemented based on queue priorities or through multi-level queues.

Volcano has been open-sourced in GitHub at https://github.com/volcano-sh/volcano.

Constraints

When upgrading the plug-in, exercise caution when you downgrade a later version to an earlier version, as this may cause job scheduling failures.

Installing a Plug-in

When you create a dedicated resource pool, this plug-in is automatically installed when Job Type is set to Training Jobs.

Components

Table 1 Plug-in components

Component

Description

Resource Type

volcano-scheduler

Schedule pods.

Deployment

volcano-controller

Synchronize CRDs.

Deployment

volcano-admission

Webhook server, which verifies and modifies resources such as pods and jobs

Deployment

Change History

Table 2 Release history

Plug-in Version

New Feature

1.17.11

  • Optimized the cabinet affinity and packing capabilities.
  • Optimized the Ascend NPU preemption capability.
  • Supported Kubernetes v1.32.
  • Supported topology affinity scheduling of Ascend high-density models.

1.16.8

  • Optimized the resource scheduling capability of supernodes.
  • Kubernetes v1.31 is supported.

1.15.8

Supported Ascend NPU dual-die affinity scheduling.

1.15.6

Resources can be oversubscribed based on pod profiling.

1.13.5

  • Supported scale-in of customized resources based on node priorities.
  • Optimized the association between preemption and node scale-out.

1.12.18

  • Adapted to CCE 1.29 clusters.
  • The preemption function is enabled by default.

1.12.1

Optimized application auto scaling performance.

1.11.9

  • Optimized sorting capability of NPU rank table.
  • Supported priority-based scheduling in autoscaling scenarios.

1.10.10

Fixed the issue that the local PV plug-in fails to calculate the number of pods pre-bound to the node.

1.10.7

Fixed the issue that the local PV plug-in fails to calculate the number of pods pre-bound to the node.

1.7.1

Supported clusters v1.25.