Updated on 2024-02-01 GMT+08:00

volcano

Introduction

Volcano is a batch processing platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.

Volcano provides general-purpose, high-performance computing capabilities, such as job scheduling engine, heterogeneous chip management, and job running management, serving end users through computing frameworks for different industries, such as AI, big data, gene sequencing, and rendering. (Volcano has been open-sourced in GitHub.)

Volcano provides job scheduling, job management, and queue management for computing applications. Its main features are as follows:

  • Diverse computing frameworks, such as TensorFlow, MPI, and Spark, can run on Kubernetes in containers. Common APIs for batch computing jobs through CRD, various add-ons, and advanced job lifecycle management are provided.
  • Advanced scheduling capabilities are provided for batch computing and high-performance computing scenarios, including group scheduling, preemptive priority scheduling, packing, resource reservation, and task topology.
  • Queues can be effectively managed for scheduling jobs. Complex job scheduling capabilities such as queue priority and multi-level queues are supported.

Open source community: https://github.com/volcano-sh/volcano

Installing the Add-on

  1. Log in to the UCS console and click the cluster name to go to its details page. In the navigation pane, choose Add-ons. Locate Volcano and click Install.
  2. Select Standalone, Custom, or HA for Add-on Specifications.

    If you select Custom, the following requests and limits are recommended for volcano-controller and volcano-scheduler:

    • If the number of nodes is less than 100, retain the default configuration. The requested CPU is 500m, and the limit is 2000m. The requested memory is 500 Mi, and the limit is 2000 Mi.
    • If the number of nodes is greater than 100, increase the requested CPU by 500m and the requested memory by 1000 Mi each time 100 nodes (10,000 pods) are added. Increase the CPU limit by 1500m and the memory limit by 1000 Mi.

      Formulas for calculating the requests and limits:

      • CPU: Calculate the number of nodes multiplied by the number of pods, perform interpolation search using the product of the number of nodes in the cluster multiplied by the number of pods in Table 1, and round up the request and limit that are closest to the specifications.

        For example, for 2,000 nodes and 20,000 pods, Number of target nodes x Number of target pods = 40 million, which is close to 700/70000 in the specification (Number of nodes x Number of pods = 49 million). You are advised to set the CPU request to 4000m and the limit to 5500m.

      • Memory: Allocate 2.4 GiB of memory to every 1,000 nodes and 1 GiB of memory to every 10,000 pods. The memory request is the sum of the two values. (The obtained value may be different from the recommended value in Table 1. You can use either of them.)

        Memory request = Number of nodes/1000 x 2.4 GiB + Number of pods/10000 x 1 GiB

        For example, for 2,000 nodes and 20,000 pods, the memory request value is 6.8 GiB (2000/1000 x 2.4 GiB + 20000/10000 x 1 GiB).

      Table 1 Recommended requests and limits for volcano-controller and volcano-scheduler

      Nodes/Pods in a Cluster

      CPU Request (m)

      CPU Limit (m)

      Memory Request (Mi)

      Memory Limit (Mi)

      50/5,000

      500

      2,000

      500

      2,000

      100/10,000

      1,000

      2,500

      1,500

      2,500

      200/20,000

      1,500

      3,000

      2,500

      3,500

      300/30,000

      2,000

      3,500

      3,500

      4,500

      400/40,000

      2,500

      4,000

      4,500

      5,500

      500/50,000

      3,000

      4,500

      5,500

      6,500

      600/60,000

      3,500

      5,000

      6,500

      7,500

      700/70,000

      4,000

      5,500

      7,500

      8,500

  3. Configure parameters of the default volcano scheduler. For details, see Table 2.

    colocation_enable: ''
    default_scheduler_conf:
      actions: 'allocate, backfill'
      tiers:
        - plugins:
            - name: 'priority'
            - name: 'gang'
            - name: 'conformance'
        - plugins:
            - name: 'drf'
            - name: 'predicates'
            - name: 'nodeorder'
        - plugins:
            - name: 'cce-gpu-topology-predicate'
            - name: 'cce-gpu-topology-priority'
            - name: 'cce-gpu'
        - plugins:
            - name: 'nodelocalvolume'
            - name: 'nodeemptydirvolume'
            - name: 'nodeCSIscheduling'
            - name: 'networkresource'
    Table 2 Volcano add-ons

    Add-on

    Function

    Description

    Demonstration

    binpack

    Schedules pods to nodes with high resource utilization to reduce resource fragments.

    • binpack.weight: weight of the binpack add-on.
    • binpack.cpu: percentage of CPU. The default value is 1.
    • binpack.memory: percentage of memory. The default value is 1.
    • binpack.resources: resource type.
    - plugins:
      - name: binpack
        arguments:
          binpack.weight: 10
          binpack.cpu: 1
          binpack.memory: 1
          binpack.resources: nvidia.com/gpu, example.com/foo
          binpack.resources.nvidia.com/gpu: 2
          binpack.resources.example.com/foo: 3

    conformance

    Prevent key pods, such as the pods in the kube-system namespace from being preempted.

    -

    -

    gang

    The gang add-on considers a group of pods as a whole to allocate resources.

    -

    -

    priority

    The priority add-on schedules pods based on the custom workload priority.

    -

    -

    overcommit

    Resources in a cluster are scheduled after being accumulated in a certain multiple to improve the workload enqueuing efficiency. If all workloads are Deployments, remove this add-on or set the raising factor to 2.0.

    overcommit-factor: Raising factor. The default value is 1.2.

    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 2.0

    drf

    Schedules resources based on the container group dominant resources. The smallest dominant resources would be selected for priority scheduling.

    -

    -

    predicates

    Determines whether a task is bound to a node using a series of evaluation algorithms, such as node/pod affinity, taint tolerance, node port repetition, volume limits, and volume zone matching.

    -

    -

    nodeorder

    The nodeorder add-on scores all nodes for a task by using a series of scoring algorithms.

    • nodeaffinity.weight: Pods are scheduled based on the node affinity. The default value is 1.
    • podaffinity.weight: Pods are scheduled based on the pod affinity. The default value is 1.
    • leastrequested.weight: Pods are scheduled to the node with the least requested resources. The default value is 1.
    • balancedresource.weight: Pods are scheduled to the node with balanced resource. The default value is 1.
    • mostrequested.weight: Pods are scheduled to the node with the most requested resources. The default value is 0.
    • tainttoleration.weight: Pods are scheduled to the node with a high taint tolerance. The default value is 1.
    • imagelocality.weight: Pods are scheduled to the node where the required images exist. The default value is 1.
    • selectorspread.weight: Pods are evenly scheduled to different nodes. The default value is 0.
    • volumebinding.weight: Pods are scheduled to the node with the local PV delayed binding policy. The default value is 1.
    • podtopologyspread.weight: Pods are scheduled based on the pod topology. The default value is 2.
    - plugins:
      - name: nodeorder
        arguments:
          leastrequested.weight: 1
          mostrequested.weight: 0
          nodeaffinity.weight: 1
          podaffinity.weight: 1
          balancedresource.weight: 1
          tainttoleration.weight: 1
          imagelocality.weight: 1
          volumebinding.weight: 1
          podtopologyspread.weight: 2

    cce-gpu-topology-predicate

    GPU-topology scheduling preselection algorithm

    -

    -

    cce-gpu-topology-priority

    GPU-topology scheduling priority algorithm

    -

    -

    cce-gpu

    GPU resource allocation that supports decimal GPU configurations by working with the gpu add-on.

    -

    -

    numaaware

    NUMA topology scheduling

    weight: Weight of the numa-aware add-on.

    -

    networkresource

    The ENI requirement node can be preselected and filtered. The parameters are transferred by CCE and do not need to be manually configured.

    NetworkType: network type (eni or vpc-router).

    -

    nodelocalvolume

    Filters out nodes that do not meet local volume requirements.

    -

    -

    nodeemptydirvolume

    Filters out nodes that do not meet the emptyDir requirements.

    -

    -

    nodeCSIscheduling

    Filters out nodes that have everest component exceptions.

    -

    -

  4. Click Install.

Modifying the volcano-scheduler Configurations Using the Console

Volcano allows you to configure the scheduler during installation, upgrade, and editing. The configuration will be synchronized to volcano-scheduler-configmap.

This section describes how to configure volcano-scheduler.

Only Volcano of v1.7.1 and later support this function. On the new add-on page, options such as plugins.eas_service and resource_exporter_enable are replaced by default_scheduler_conf.

Log in to the CCE console and access the cluster console. Choose Add-ons in the navigation pane. On the right of the page, locate volcano and click Install or Upgrade. In the Parameters area, configure the volcano-scheduler parameters.

  • Using resource_exporter:
    {
        "ca_cert": "",
        "default_scheduler_conf": {
            "actions": "allocate, backfill",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "cce-gpu"
                        },
                        {
                            "name": "numa-aware" # add this also enable resource_exporter
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "server_cert": "",
        "server_key": ""
    }

    After the parameters are configured, you can use the functions of the numa-aware add-on and resource_exporter at the same time.

  • Using eas_service:
    {
        "ca_cert": "",
        "default_scheduler_conf": {
            "actions": "allocate, backfill",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "cce-gpu"
                        },
                        {
                            "name": "eas",
                            "custom": {
                                "availability_zone_id": "",
                                "driver_id": "",
                                "endpoint": "",
                                "flavor_id": "",
                                "network_type": "",
                                "network_virtual_subnet_id": "",
                                "pool_id": "",
                                "project_id": "",
                                "secret_name": "eas-service-secret"
                            }
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "server_cert": "",
        "server_key": ""
    }
  • Using ief:
    {
        "ca_cert": "",
        "default_scheduler_conf": {
            "actions": "allocate, backfill",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "cce-gpu"
                        },
                        {
                            "name": "ief",
                            "enableBestNode": true
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "server_cert": "",
        "server_key": ""
    }

Retaining the Original Configurations of volcano-scheduler-configmap

If you want to use the original configurations after the add-on is upgraded, perform the following steps:

  1. Check and back up the original volcano-scheduler-configmap configuration.

    Example:
    # kubectl edit cm volcano-scheduler-configmap -n kube-system
    apiVersion: v1
    data:
      default-scheduler.conf: |-
        actions: "enqueue, allocate, backfill"
        tiers:
        - plugins:
          - name: priority
          - name: gang
          - name: conformance
        - plugins:
          - name: drf
          - name: predicates
          - name: nodeorder
          - name: binpack
            arguments:
              binpack.cpu: 100
              binpack.weight: 10
              binpack.resources: nvidia.com/gpu
              binpack.resources.nvidia.com/gpu: 10000
        - plugins:
          - name: cce-gpu-topology-predicate
          - name: cce-gpu-topology-priority
          - name: cce-gpu
        - plugins:
          - name: nodelocalvolume
          - name: nodeemptydirvolume
          - name: nodeCSIscheduling
          - name: networkresource

  2. Enter the customized content in the Parameters area on the console.

    {
        "ca_cert": "",
        "default_scheduler_conf": {
            "actions": "enqueue, allocate, backfill",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder"
                        },
                        {
                            "name": "binpack",
                            "arguments": {
                                "binpack.cpu": 100,
                                "binpack.weight": 10,
                                "binpack.resources": "nvidia.com/gpu",
                                "binpack.resources.nvidia.com/gpu": 10000
                            }
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "cce-gpu"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "server_cert": "",
        "server_key": ""
    }

    After the parameters are configured, the original content in volcano-scheduler-configmap will be overwritten. Therefore, you must check whether volcano-scheduler-configmap has been modified during the upgrade. If volcano-scheduler-configmap has been modified, synchronize the modification to the upgrade page.

Change History

You are advised to upgrade Volcano to the latest version that matches the cluster.

Table 3 Cluster version mapping

Cluster Version

Add-on Version

v1.25

1.7.1 and 1.7.2

v1.23

1.7.1 and 1.7.2

v1.21

1.7.1 and 1.7.2

v1.19.16

1.3.7, 1.3.10, 1.4.5, 1.7.1, and 1.7.2

v1.19

1.3.7, 1.3.10, and 1.4.5

v1.17 (End of maintenance)

1.3.7, 1.3.10, and 1.4.5

v1.15 (End of maintenance)

1.3.7, 1.3.10, and 1.4.5

Table 4 CCE add-on versions

Add-on Version

Supported Cluster Version

Updated Feature

1.9.1

/v1.19.16.*|v1.21.*|v1.23.*|v1.25.*/

  • Fixed the issue that the counting pipeline pod of the networkresource add-on occupies supplementary network interfaces (Sub-ENI).
  • Fixed the issue where the binpack add-on scores nodes with insufficient resources.
  • Fixed the issue of processing resources in the pod with unknown end status.
  • Optimized event output.
  • Supports HA deployment by default.

1.7.2

/v1.19.16.*|v1.21.*|v1.23.*|v1.25.*/

  • Supported Kubernetes 1.25.
  • Improved Volcano scheduling.

1.7.1

/v1.19.16.*|v1.21.*|v1.23.*|v1.25.*/

Supported Kubernetes 1.25.

1.6.5

/v1.19.*|v1.21.*|v1.23.*/

  • Served as the CCE default scheduler.
  • Supported unified scheduling in hybrid deployments.

1.4.5

/v1.17.*|v1.19.*|v1.21.*/

  • Changed the deployment mode of volcano-scheduler from statefulset to deployment. Fixed the issue that pods cannot be automatically migrated when the node is abnormal.

1.4.2

/v1.15.*|v1.17.*|v1.19.*|v1.21.*/

  • Resolved the issue that cross-GPU allocation fails.
  • Supported the updated EAS API.

1.3.3

/v1.15.*|v1.17.*|v1.19.*|v1.21.*/

  • Fixed the scheduler crash issue caused by GPU exceptions and the admission failure issue for privileged init containers.

1.3.1

/v1.15.*|v1.17.*|v1.19.*/

  • Upgraded the RAID controller card firmware to the latest version.
  • Supported Kubernetes 1.19.
  • Added the numa-aware add-on.
  • Fixed the deployment scaling issue in the multi-queue scenario.
  • Adjusted the algorithm add-on enabled by default.

1.2.5

/v1.15.*|v1.17.*|v1.19.*/

  • Fixed the OutOfcpu issue in some scenarios.
  • Fixed the issue that pods cannot be scheduled when some capabilities are set for a queue.
  • Made the log time of the volcano component consistent with the system time.
  • Fixed the issue of preemption between multiple queues.
  • Fixed the issue that the result of the ioaware add-on does not meet the expectation in some extreme scenarios.
  • Supported hybrid clusters.

1.2.3

/v1.15.*|v1.17.*|v1.19.*/

  • Fixed the training task OOM issue caused by insufficient precision.
  • Fixed the GPU scheduling issue in CCE 1.15 and later versions. Rolling upgrade of CCE versions during task distribution is not supported.
  • Fixed the issue where the queue status is unknown in certain scenarios.
  • Fixed the issue where a panic occurs when a PVC is mounted to a job in a specific scenario.
  • Fixed the issue that decimals cannot be configured for GPU jobs.
  • Added the ioaware add-on.
  • Added the ring controller.