AI Workload Scheduling
This section describes the key functions of Volcano Scheduler in AI workload scheduling, including auto scheduling, task scheduling, heterogeneous resource scheduling, and queue scheduling. Volcano Scheduler offers general computing capabilities, including a high-performance task scheduling engine, efficient management of heterogeneous chips, and advanced task execution management. These features enhance the scheduling efficiency and execution performance of AI workloads.
Auto Scheduling
Volcano Scheduler supports priority-based scheduling specifically designed to optimize application scaling.
Feature |
Description |
---|---|
With application scaling priority policies, you can manage resources more efficiently by customizing the scaling order of pods across different types of nodes. There are:
By default, yearly/monthly nodes are prioritized over pay-per-use nodes. During a scale-out, Volcano Scheduler schedules pods to yearly/monthly nodes first. During a scale-in, it deletes pods from pay-per-use nodes before those on yearly/monthly nodes. |
Task Scheduling
Volcano Scheduler provides Dominant Resource Fairness (DRF) and gang scheduling for batch computing tasks.
Feature |
Description |
---|---|
Volcano Scheduler supports fair scheduling (DRF), which is built on the max-min fairness share algorithm. This ensures equitable resource allocation among multiple users by evaluating key resources like CPUs, memory, and storage, and allocating them fairly according to individual user requirements during scheduling. By implementing DRF, service throughput of clusters can be maximized, overall execution time shortened, and training performance enhanced. This makes it an ideal scheduling approach for workloads like batch AI training and big data processing. |
|
Volcano Scheduler supports gang scheduling, an "all-or-nothing" approach that prevents resource wastage from arbitrary pod scheduling. It checks if the number of pods scheduled for a job meets the minimum required for execution. If the threshold is met, all pods are scheduled simultaneously; otherwise, none are scheduled. Gang scheduling reduces resource busy-waiting and deadlocks in distributed training, thereby enhancing cluster resource utilization. |
Heterogeneous Resource Scheduling
Volcano Scheduler provides GPU sharing scheduling, NUMA-aware scheduling, and NPU topology scheduling for heterogeneous resources such as CPUs, GPUs, and NPUs.
Feature |
Description |
---|---|
Volcano Scheduler facilitates GPU virtualization scheduling and isolation on GPU nodes, offering the following policies for managing GPU virtualization workloads:
|
|
Volcano Scheduler includes NUMA affinity scheduling, a feature that assigns pods to worker nodes with minimal cross-NUMA node access. This approach reduces data transmission overhead, optimizes resource utilization, and enhances overall system performance. |
|
Volcano Scheduler provides intra-node NPU topology-aware scheduling, leveraging the hardware topology of Ascend AI processors. This intelligent resource management technology optimizes resource allocation and network path selection, effectively reducing compute resource fragmentation and minimizing network congestion. As a result, it maximizes NPU compute utilization and significantly enhances the efficiency of AI training and inference tasks. This ensures the efficient scheduling and management of Ascend compute resources. |
|
Volcano Scheduler supports hypernode topology affinity scheduling. A hypernode is composed of 48 nodes. The NPUs within the hypernode form a hyperplane network through a specialized network connection. This configuration enables significantly faster data transmission rates compared to traditional setups. Hypernode topology affinity scheduling assigns pods with a high interdependence to the same hypernode. By doing so, it minimizes cross-node communication, reduces network latency, and boosts data transmission speeds. |
Queue Scheduling
Volcano Scheduler supports queue scheduling to effectively manage AI and batch computing tasks.
Feature |
Description |
---|---|
Queue is a core concept in Volcano. It is designed to support resource allocation and tasks scheduling in multi-tenant scenarios. With queues, you can implement multi-tenant resource allocation, task priority control, and resource preemption and reclamation. All these significantly improve cluster resource utilization and task scheduling efficiency. |
|
In real-world applications, different queues typically belong to different departments, which often have hierarchical relationships. This structure leads to more complex, refined requirements for resource allocation and preemption. Traditional peer queues, however, cannot meet these needs effectively. To address this, Volcano Scheduler introduces hierarchical queues, which enable resource allocation, sharing, and preemption across different levels. With hierarchical queues, you can manage resource quotas at a finer granularity and build a more efficient, unified scheduling platform. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot