Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Volcano Scheduler

Updated on 2024-01-26 GMT+08:00

Introduction

Volcano is a batch processing platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.

Volcano provides general-purpose, high-performance computing capabilities, such as job scheduling, heterogeneous chip management, and job running management, serving end users through computing frameworks for different industries, such as AI, big data, gene sequencing, and rendering.

Volcano provides job scheduling, job management, and queue management for computing applications. Its main features are as follows:

  • Diverse computing frameworks, such as TensorFlow, MPI, and Spark, can run on Kubernetes in containers. Common APIs for batch computing jobs through CRD, various plug-ins, and advanced job lifecycle management are provided.
  • Advanced scheduling capabilities are provided for batch computing and high-performance computing scenarios, including group scheduling, preemptive priority scheduling, packing, resource reservation, and task topology.
  • Queues can be effectively managed for scheduling jobs. Complex job scheduling capabilities such as queue priority and multi-level queues are supported.

Volcano has been open-sourced in GitHub at https://github.com/volcano-sh/volcano.

Install and configure the Volcano add-on in CCE clusters. For details, see Volcano Scheduling.

NOTE:

When using Volcano as a scheduler, use it to schedule all workloads in the cluster. This prevents resource scheduling conflicts caused by simultaneous working of multiple schedulers.

Installing the Add-on

  1. Log in to the CCE console and click the cluster name to access the cluster console. Choose Add-ons in the navigation pane, locate Volcano Scheduler on the right, and click Install.
  2. On the Install Add-on page, configure the specifications.

    Table 1 Add-on configuration

    Parameter

    Description

    Add-on Specifications

    Select Standalone, HA, or Custom for Add-on Specifications.

    Pods

    Number of pods that will be created to match the selected add-on specifications.

    If you select Custom, you can adjust the number of pods as required.

    Multi-AZ

    • Preferred: Deployment pods of the add-on will be preferentially scheduled to nodes in different AZs. If all the nodes in the cluster are deployed in the same AZ, the pods will be scheduled to that AZ.
    • Required: Deployment pods of the add-on will be forcibly scheduled to nodes in different AZs. If there are fewer AZs than pods, the extra pods will fail to run.

    Containers

    CPU and memory quotas of the container allowed for the selected add-on specifications.

    If you select Custom, the recommended values for volcano-controller and volcano-scheduler are as follows:

    • If the number of nodes is less than 100, retain the default configuration. The requested CPU is 500 m, and the limit is 2000 m. The requested memory is 500 MiB, and the limit is 2000 MiB.
    • If the number of nodes is greater than 100, increase the requested CPU by 500 m and the requested memory by 1000 MiB each time 100 nodes (10,000 pods) are added. Increase the CPU limit by 1500 m and the memory limit by 1000 MiB.
      NOTE:

      Recommended formula for calculating the request value:

      • CPU request value: Calculate the number of target nodes multiplied by the number of target pods, perform interpolation search based on the number of nodes in the cluster multiplied by the number of target pods in Table 2, and round up the request value and limit value that are closest to the specifications.

        For example, for 2000 nodes and 20,000 pods, Number of target nodes x Number of target pods = 40 million, which is close to the specification of 700/70,000 (Number of cluster nodes x Number of pods = 49 million). According to the following table, set the requested vCPUs to 4000m and the limit value to 5500m.

      • Memory request value: It is recommended that 2.4 GiB memory be allocated to every 1000 nodes and 1 GiB memory be allocated to every 10,000 pods. The memory request value is the sum of these two values. (The obtained value may be different from the recommended value in Table 2. You can use either of them.)

        Memory request = Number of target nodes/1000 x 2.4 GiB + Number of target pods/10000 x 1 GiB

        For example, for 2000 nodes and 20,000 pods, the memory request value is 6.8 GiB, that is, 2000/1000 x 2.4 GiB + 20000/10000 x 1 GiB.

    Table 2 Recommended values for volcano-controller and volcano-scheduler

    Nodes/Pods in a Cluster

    CPU Request (m)

    CPU Limit (m)

    Memory Request (MiB)

    Memory Limit (MiB)

    50/5,000

    500

    2000

    500

    2000

    100/10,000

    1000

    2500

    1500

    2500

    200/20,000

    1500

    3000

    2500

    3500

    300/30,000

    2000

    3500

    3500

    4500

    400/40,000

    2500

    4000

    4500

    5500

    500/50,000

    3000

    4500

    5500

    6500

    600/60,000

    3500

    5000

    6500

    7500

    700/70,000

    4000

    5500

    7500

    8500

  3. Configure the add-on parameters.

    Configure parameters of the default volcano scheduler. For details, see Table 4.
    colocation_enable: ''
    default_scheduler_conf:
      actions: 'allocate, backfill'
      tiers:
        - plugins:
            - name: 'priority'
            - name: 'gang'
            - name: 'conformance'
            - name: 'lifecycle'
              arguments:
                lifecycle.MaxGrade: 10
                lifecycle.MaxScore: 200.0
                lifecycle.SaturatedTresh: 1.0
                lifecycle.WindowSize: 10
        - plugins:
            - name: 'drf'
            - name: 'predicates'
            - name: 'nodeorder'
        - plugins:
            - name: 'cce-gpu-topology-predicate'
            - name: 'cce-gpu-topology-priority'
            - name: 'cce-gpu'
        - plugins:
            - name: 'nodelocalvolume'
            - name: 'nodeemptydirvolume'
            - name: 'nodeCSIscheduling'
            - name: 'networkresource'
    tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 60
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 60
    Table 3 Advanced Volcano configuration parameters

    Plug-in

    Function

    Description

    Demonstration

    default_scheduler_conf

    Used to schedule pods. It consists of a series of actions and plug-ins and features high scalability. You can specify and implement actions and plug-ins based on your requirements.

    It consists of actions and tiers.

    • actions: defines the types and sequence of actions to be executed by the scheduler.
    • tiers: configures the plug-in list.

    None

    actions

    Actions to be executed in each scheduling phase. The configured action sequence is the scheduler execution sequence. For details, see Actions.

    The scheduler traverses all jobs to be scheduled and performs actions such as enqueue, allocate, preempt, and backfill in the configured sequence to find the most appropriate node for each job.

    The following options are supported:

    • enqueue: uses a series of filtering algorithms to filter out tasks to be scheduled and sends them to the queue to wait for scheduling. After this action, the task status changes from pending to inqueue.
    • allocate: selects the most suitable node based on a series of pre-selection and selection algorithms.
    • preempt: performs preemption scheduling for tasks with higher priorities in the same queue based on priority rules.
    • backfill: schedules pending tasks as much as possible to maximize the utilization of node resources.
    actions: 'allocate, backfill'
    NOTE:

    When configuring actions, use either preempt or enqueue.

    plugins

    Implementation details of algorithms in actions based on different scenarios. For details, see Plugins.

    For details, see Table 4.

    None

    tolerations

    Tolerance of the add-on to node taints.

    By default, the add-on can run on nodes with the node.kubernetes.io/not-ready or node.kubernetes.io/unreachable taint and the taint effect value is NoExecute, but it'll be evicted in 60 seconds.

    tolerations:
      - effect: NoExecute
        key: node.kubernetes.io/not-ready
        operator: Exists
        tolerationSeconds: 60
      - effect: NoExecute
        key: node.kubernetes.io/unreachable
        operator: Exists
        tolerationSeconds: 60
    Table 4 Supported plug-ins

    Plug-in

    Function

    Description

    Demonstration

    binpack

    Schedule pods to nodes with high resource usage (not allocating pods to light-loaded nodes) to reduce resource fragments.

    arguments:

    • binpack.weight: weight of the binpack plug-in.
    • binpack.cpu: ratio of CPUs to all resources. The parameter value defaults to 1.
    • binpack.memory: ratio of memory resources to all resources. The parameter value defaults to 1.
    • binpack.resources: other custom resource types requested by the pod, for example, nvidia.com/gpu. Multiple types can be configured and be separated by commas (,).
    • binpack.resources.<your_resource>: weight of your custom resource in all resources. Multiple types of resources can be added. <your_resource> indicates the resource type defined in binpack.resources, for example, binpack.resources.nvidia.com/gpu.
    - plugins:
      - name: binpack
        arguments:
          binpack.weight: 10
          binpack.cpu: 1
          binpack.memory: 1
          binpack.resources: nvidia.com/gpu, example.com/foo
          binpack.resources.nvidia.com/gpu: 2
          binpack.resources.example.com/foo: 3

    conformance

    Prevent key pods, such as the pods in the kube-system namespace from being preempted.

    None

    - plugins:
      - name: 'priority'
      - name: 'gang'
        enablePreemptable: false
      - name: 'conformance'

    lifecycle

    By collecting statistics on service scaling rules, pods with similar lifecycles are preferentially scheduled to the same node. With the horizontal scaling capability of the autoscaler, resources can be quickly scaled in and released, reducing costs and improving resource utilization.

    1. Collects statistics on the lifecycle of pods in the service load and schedules pods with similar lifecycles to the same node.

    2. For a cluster configured with an automatic scaling policy, adjust the scale-in annotation of the node to preferentially scale in the node with low usage.

    arguments:
    • lifecycle.WindowSize: The value is an integer greater than or equal to 1 and defaults to 10.

      Record the number of times that the number of replicas changes. If the load changes regularly and periodically, decrease the value. If the load changes irregularly and the number of replicas changes frequently, increase the value. If the value is too large, the learning period is prolonged and too many events are recorded.

    • lifecycle.MaxGrade: The value is an integer greater than or equal to 3 and defaults to 3.

      It indicates levels of replicas. For example, if the value is set to 3, the replicas are classified into three levels. If the load changes regularly and periodically, decrease the value. If the load changes irregularly, increase the value. Setting an excessively small value may result in inaccurate lifecycle forecasts.

    • lifecycle.MaxScore: float64 floating point number. The value must be greater than or equal to 50.0. The default value is 200.0.

      Maximum score (equivalent to the weight) of the lifecycle plugin.

    • lifecycle.SaturatedTresh: float64 floating point number. If the value is less than 0.5, use 0.5. If the value is greater than 1, use 1. The default value is 0.8.

      Threshold for determining whether the node usage is too high. If the node usage exceeds the threshold, the scheduler preferentially schedules jobs to other nodes.

    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance
      - name: lifecycle
        arguments:
          lifecycle.MaxGrade: 10
          lifecycle.MaxScore: 200.0
          lifecycle.SaturatedTresh: 1.0
          lifecycle.WindowSize: 10
    NOTE:
    • For nodes that do not want to be scaled in, manually mark them as long-period nodes and add the annotation volcano.sh/long-lifecycle-node: true to them. For an unmarked node, the lifecycle plugin automatically marks the node based on the lifecycle of the load on the node.
    • The default value of MaxScore is 200.0, which is twice the weight of other plugins. When the lifecycle plugin does not have obvious effect or conflicts with other plugins, disable other plugins or increase the value of MaxScore.
    • After the scheduler is restarted, the lifecycle plugin needs to re-record the load change. The optimal scheduling effect can be achieved only after several periods of statistics are collected.

    gang

    Consider a group of pods as a whole for resource allocation. This plug-in checks whether the number of scheduled pods in a job meets the minimum requirements for running the job. If yes, all pods in the job will be scheduled. If no, the pods will not be scheduled.

    NOTE:

    If a gang scheduling policy is used, if the remaining resources in the cluster are greater than or equal to half of the minimum number of resources for running a job but less than the minimum of resources for running the job, autoscaler scale-outs will not be triggered.

    • enablePreemptable:
      • true: Preemption enabled
      • false: Preemption not enabled
    • enableJobStarving:
      • true: Resources are preempted based on the minAvailable setting of jobs.
      • false: Resources are preempted based on job replicas.
      NOTE:
      • The default value of minAvailable for Kubernetes-native workloads (such as Deployments) is 1. It is a good practice to set enableJobStarving to false.
      • In AI and big data scenarios, you can specify the minAvailable value when creating a vcjob. It is a good practice to set enableJobStarving to true.
      • In Volcano versions earlier than v1.11.5, enableJobStarving is set to true by default. In Volcano versions later than v1.11.5, enableJobStarving is set to false by default.
    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
         enableJobStarving: false
      - name: conformance

    priority

    Schedule based on custom load priorities.

    None

    - plugins:
      - name: priority
      - name: gang
        enablePreemptable: false
      - name: conformance

    overcommit

    Resources in a cluster are scheduled after being accumulated in a certain multiple to improve the workload enqueuing efficiency. If all workloads are Deployments, remove this plugin or set the raising factor to 2.0.

    NOTE:

    This plug-in is supported in Volcano 1.6.5 and later versions.

    arguments:

    • overcommit-factor: inflation factor, which defaults to 1.2.
    - plugins:
      - name: overcommit
        arguments:
          overcommit-factor: 2.0

    drf

    The Dominant Resource Fairness (DRF) scheduling algorithm, which schedules jobs based on their dominant resource share. Jobs with a smaller resource share will be scheduled with a higher priority.

    None

    - plugins:
      - name: 'drf'
      - name: 'predicates'
      - name: 'nodeorder'

    predicates

    Determine whether a task is bound to a node by using a series of evaluation algorithms, such as node/pod affinity, taint tolerance, node repetition, volume limits, and volume zone matching.

    None

    - plugins:
      - name: 'drf'
      - name: 'predicates'
      - name: 'nodeorder'

    nodeorder

    A common algorithm for selecting nodes. Nodes are scored in simulated resource allocation to find the most suitable node for the current job.

    Scoring parameters:

    • nodeaffinity.weight: Pods are scheduled based on node affinity. This parameter defaults to 1.
    • podaffinity.weight: Pods are scheduled based on pod affinity. This parameter defaults to 1.
    • leastrequested.weight: Pods are scheduled to the node with the least requested resources. This parameter defaults to 1.
    • balancedresource.weight: Pods are scheduled to the node with balanced resource allocation. This parameter defaults to 1.
    • mostrequested.weight: Pods are scheduled to the node with the most requested resources. This parameter defaults to 0.
    • tainttoleration.weight: Pods are scheduled to the node with a high taint tolerance. This parameter defaults to 1.
    • imagelocality.weight: Pods are scheduled to the node where the required images exist. This parameter defaults to 1.
    • selectorspread.weight: Pods are evenly scheduled to different nodes. This parameter defaults to 0.
    • podtopologyspread.weight: Pods are scheduled based on the pod topology. This parameter defaults to 2.
    - plugins:
      - name: nodeorder
        arguments:
          leastrequested.weight: 1
          mostrequested.weight: 0
          nodeaffinity.weight: 1
          podaffinity.weight: 1
          balancedresource.weight: 1
          tainttoleration.weight: 1
          imagelocality.weight: 1
          volumebinding.weight: 1
          podtopologyspread.weight: 2

    cce-gpu-topology-predicate

    GPU-topology scheduling preselection algorithm

    None

    - plugins:
      - name: 'cce-gpu-topology-predicate'
      - name: 'cce-gpu-topology-priority'
      - name: 'cce-gpu'

    cce-gpu-topology-priority

    GPU-topology scheduling priority algorithm

    None

    - plugins:
      - name: 'cce-gpu-topology-predicate'
      - name: 'cce-gpu-topology-priority'
      - name: 'cce-gpu'

    cce-gpu

    GPU resource allocation that supports decimal GPU configurations by working with the gpu add-on.

    None

    - plugins:
      - name: 'cce-gpu-topology-predicate'
      - name: 'cce-gpu-topology-priority'
      - name: 'cce-gpu'

    numa-aware

    NUMA affinity scheduling.

    arguments:

    • weight: weight of the numa-aware plug-in
    - plugins:
      - name: 'nodelocalvolume'
      - name: 'nodeemptydirvolume'
      - name: 'nodeCSIscheduling'
      - name: 'networkresource'
        arguments:
          NetworkType: vpc-router
      - name: numa-aware
        arguments:
          weight: 10

    networkresource

    The ENI requirement node can be preselected and filtered. The parameters are transferred by CCE and do not need to be manually configured.

    arguments:

    • NetworkType: network type (eni or vpc-router)
    - plugins:
      - name: 'nodelocalvolume'
      - name: 'nodeemptydirvolume'
      - name: 'nodeCSIscheduling'
      - name: 'networkresource'
        arguments:
          NetworkType: vpc-router

    nodelocalvolume

    Filter out nodes that do not meet local volume requirements.

    None

    - plugins:
      - name: 'nodelocalvolume'
      - name: 'nodeemptydirvolume'
      - name: 'nodeCSIscheduling'
      - name: 'networkresource'

    nodeemptydirvolume

    Filter out nodes that do not meet the emptyDir requirements.

    None

    - plugins:
      - name: 'nodelocalvolume'
      - name: 'nodeemptydirvolume'
      - name: 'nodeCSIscheduling'
      - name: 'networkresource'

    nodeCSIscheduling

    Filter out nodes with malfunctional everest.

    None

    - plugins:
      - name: 'nodelocalvolume'
      - name: 'nodeemptydirvolume'
      - name: 'nodeCSIscheduling'
      - name: 'networkresource'

  4. Click Install.

Components

Table 5 Volcano components

Container Component

Description

Resource Type

volcano-scheduler

Schedule pods.

Deployment

volcano-controller

Synchronize CRDs.

Deployment

volcano-admission

Webhook server, which verifies and modifies resources such as pods and jobs

Deployment

volcano-agent

Cloud native hybrid agent, which is used for node QoS assurance, CPU burst, and dynamic resource oversubscription

DaemonSet

resource-exporter

Report the NUMA topology information of nodes.

DaemonSet

Modifying the volcano-scheduler Configurations Using the Console

Volcano scheduler is the component responsible for pod scheduling. It consists of a series of actions and plug-ins. Actions should be executed in every step. Plugins provide the action algorithm details in different scenarios. volcano-scheduler is highly scalable. You can specify and implement actions and plug-ins based on your requirements.

Volcano allows you to configure the scheduler during installation, upgrade, and editing. The configuration will be synchronized to volcano-scheduler-configmap.

This section describes how to configure volcano-scheduler.

NOTE:

Only Volcano of v1.7.1 and later support this function. On the new plugin page, options such as plugins.eas_service and resource_exporter_enable are replaced by default_scheduler_conf.

Log in to the CCE console and access the cluster console. Choose Add-ons in the navigation pane. On the right of the page, locate volcano and click Install or Upgrade. In the Parameters area, configure the volcano-scheduler parameters.

  • Using resource_exporter:
    {
        "ca_cert": "",
        "default_scheduler_conf": {
            "actions": "allocate, backfill",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "cce-gpu"
                        },
                        {
                            "name": "numa-aware" # add this also enable resource_exporter
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "server_cert": "",
        "server_key": ""
    }

    After this function is enabled, you can use the functions of the numa-aware plugin and resource_exporter at the same time.

  • Using eas_service:
    {
        "ca_cert": "",
        "default_scheduler_conf": {
            "actions": "allocate, backfill",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "cce-gpu"
                        },
                        {
                            "name": "eas",
                            "custom": {
                                "availability_zone_id": "",
                                "driver_id": "",
                                "endpoint": "",
                                "flavor_id": "",
                                "network_type": "",
                                "network_virtual_subnet_id": "",
                                "pool_id": "",
                                "project_id": "",
                                "secret_name": "eas-service-secret"
                            }
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "server_cert": "",
        "server_key": ""
    }
  • Using ief:
    {
        "ca_cert": "",
        "default_scheduler_conf": {
            "actions": "allocate, backfill",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "cce-gpu"
                        },
                        {
                            "name": "ief",
                            "enableBestNode": true
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "server_cert": "",
        "server_key": ""
    }

Retaining the Original volcano-scheduler-configmap Configurations

If you want to use the original configuration after the plug-in is upgraded, perform the following steps:

  1. Check and back up the original volcano-scheduler-configmap configuration.

    Example:
    # kubectl edit cm volcano-scheduler-configmap -n kube-system
    apiVersion: v1
    data:
      default-scheduler.conf: |-
        actions: "enqueue, allocate, backfill"
        tiers:
        - plugins:
          - name: priority
          - name: gang
          - name: conformance
        - plugins:
          - name: drf
          - name: predicates
          - name: nodeorder
          - name: binpack
            arguments:
              binpack.cpu: 100
              binpack.weight: 10
              binpack.resources: nvidia.com/gpu
              binpack.resources.nvidia.com/gpu: 10000
        - plugins:
          - name: cce-gpu-topology-predicate
          - name: cce-gpu-topology-priority
          - name: cce-gpu
        - plugins:
          - name: nodelocalvolume
          - name: nodeemptydirvolume
          - name: nodeCSIscheduling
          - name: networkresource

  2. Enter the customized content in the Parameters area on the console.

    {
        "ca_cert": "",
        "default_scheduler_conf": {
            "actions": "enqueue, allocate, backfill",
            "tiers": [
                {
                    "plugins": [
                        {
                            "name": "priority"
                        },
                        {
                            "name": "gang"
                        },
                        {
                            "name": "conformance"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "drf"
                        },
                        {
                            "name": "predicates"
                        },
                        {
                            "name": "nodeorder"
                        },
                        {
                            "name": "binpack",
                            "arguments": {
                                "binpack.cpu": 100,
                                "binpack.weight": 10,
                                "binpack.resources": "nvidia.com/gpu",
                                "binpack.resources.nvidia.com/gpu": 10000
                            }
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "cce-gpu-topology-predicate"
                        },
                        {
                            "name": "cce-gpu-topology-priority"
                        },
                        {
                            "name": "cce-gpu"
                        }
                    ]
                },
                {
                    "plugins": [
                        {
                            "name": "nodelocalvolume"
                        },
                        {
                            "name": "nodeemptydirvolume"
                        },
                        {
                            "name": "nodeCSIscheduling"
                        },
                        {
                            "name": "networkresource"
                        }
                    ]
                }
            ]
        },
        "server_cert": "",
        "server_key": ""
    }
    NOTE:

    When this function is used, the original content in volcano-scheduler-configmap will be overwritten. Therefore, you must check whether volcano-scheduler-configmap has been modified during the upgrade. If yes, synchronize the modification to the upgrade page.

Uninstalling the Volcano Add-on

After the add-on is uninstalled, all custom Volcano resources (Table 6) will be deleted, including the created resources. Reinstalling the add-on will not inherit or restore the tasks before the uninstallation. It is a good practice to uninstall the Volcano add-on only when no custom Volcano resources are being used in the cluster.

Table 6 Custom Volcano resources

Item

API Group

API Version

Resource Level

Command

bus.volcano.sh

v1alpha1

Namespaced

Job

batch.volcano.sh

v1alpha1

Namespaced

Numatopology

nodeinfo.volcano.sh

v1alpha1

Cluster

PodGroup

scheduling.volcano.sh

v1beta1

Namespaced

Queue

scheduling.volcano.sh

v1beta1

Cluster

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback