Volcano_Add-ons_Single-Cluster Management_UCS Clusters_User Guide_Ubiquitous Cloud Native Service-Huawei Cloud

Introduction

Volcano is a batch processing platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.

Volcano provides general-purpose, high-performance computing capabilities, such as job scheduling engine, heterogeneous chip management, and job running management, serving end users through computing frameworks for different industries, such as AI, big data, gene sequencing, and rendering. (Volcano has been open-sourced in GitHub.)

Volcano provides job scheduling, job management, and queue management for computing applications. Its main features are as follows:

Diverse computing frameworks, such as TensorFlow, MPI, and Spark, can run on Kubernetes in containers. Common APIs for batch computing jobs through CRD, various add-ons, and advanced job lifecycle management are provided.
Advanced scheduling capabilities are provided for batch computing and high-performance computing scenarios, including group scheduling, preemptive priority scheduling, packing, resource reservation, and task topology.
Queues can be effectively managed for scheduling jobs. Complex job scheduling capabilities such as queue priority and multi-level queues are supported.

Open source community: https://github.com/volcano-sh/volcano

Installing the Add-on

Install the Volcano add-on. An on-premises cluster does not support multi-AZ deployment and node affinity policies of the add-on pods.

After the Volcano add-on is installed in an on-premises cluster, only Volcano can be configured to schedule the created workload in YAML.

Log in to the UCS console and click the cluster name to access the cluster console. In the navigation pane, choose Add-ons. Locate Volcano and click Install.

Select Standalone, Custom, or HA for Add-on Specifications.

If you select Custom, the following requests and limits are recommended for volcano-controller and volcano-scheduler:

If the number of nodes is less than 100, retain the default configuration. The requested vCPUs are 500m, and the limit is 2000m. The requested memory is 500 MiB, and the limit is 2000 MiB.

If the number of nodes is greater than 100, increase the requested vCPUs by 500m and the requested memory by 1000 MiB each time 100 nodes (10,000 pods) are added. Increase the vCPU limit by 1500m and the memory limit by 1000 MiB.

Recommended formulas for calculating the requested values:

Requested vCPUs: Calculate the number of target nodes multiplied by the number of target pods, perform interpolation search based on the number of nodes in the cluster multiplied by the number of target pods in Table 1, and round up the request value and limit value to ones that are closest to the specifications.
For example, for 2,000 nodes (20,000 pods), the product of the number of nodes multiplied by the number of pods is 40 million, which is close to 700/70,000 in the specification (Number of nodes × Number of pods = 49 million). Set the CPU request to 4000m and the limit to 5500m.
Requested memory: It is recommended that 2.4 GiB of memory be allocated to every 1,000 nodes and 1 GiB of memory be allocated to every 10,000 pods. The requested memory is the sum of these two values. (The obtained value may be different from the recommended value in Table 1. You can use either of them.)
Requested memory = Number of nodes/1000 × 2.4 GiB + Number of pods/10000 × 1 GiB

For example, for 2,000 nodes and 20,000 pods, the requested memory is 6.8 GiB (2000/1000 × 2.4 GiB + 20,000/10,000 × 1 GiB).

**Table 1** Recommended requests and limits for volcano-controller and volcano-scheduler
Nodes/Pods in a Cluster	CPU Request (m)	CPU Limit (m)	Memory Request (Mi)	Memory Limit (Mi)
50/5,000	500	2,000	500	2,000
100/10,000	1,000	2,500	1,500	2,500
200/20,000	1,500	3,000	2,500	3,500
300/30,000	2,000	3,500	3,500	4,500
400/40,000	2,500	4,000	4,500	5,500
500/50,000	3,000	4,500	5,500	6,500
600/60,000	3,500	5,000	6,500	7,500
700/70,000	4,000	5,500	7,500	8,500

Configure the parameters of the default Volcano scheduler. For details, see Table 2.

colocation_enable: ''
default_scheduler_conf:
  actions: 'allocate, backfill'
  tiers:
    - plugins:
        - name: 'priority'
        - name: 'gang'
        - name: 'conformance'
    - plugins:
        - name: 'drf'
        - name: 'predicates'
        - name: 'nodeorder'
    - plugins:
        - name: 'cce-gpu-topology-predicate'
        - name: 'cce-gpu-topology-priority'
        - name: 'cce-gpu'
    - plugins:
        - name: 'nodelocalvolume'
        - name: 'nodeemptydirvolume'
        - name: 'nodeCSIscheduling'
        - name: 'networkresource'

**Table 2** Volcano add-ons
Add-on	Function	Description	Demonstration
resource_exporter_enable	Collects NUMA topology information of a node.	Values: true: You can view the NUMA topology information of the current node. false: This option disables the NUMA topology information of the current node.	-
binpack	Schedules pods to nodes with high resource utilization to reduce resource fragments.	binpack.weight: weight of the binpack add-on. binpack.cpu: percentage of CPU. The default value is 1. binpack.memory: percentage of memory. The default value is 1. binpack.resources: resource type.	- plugins: - name: binpack arguments: binpack.weight: 10 binpack.cpu: 1 binpack.memory: 1 binpack.resources: nvidia.com/gpu, example.com/foo binpack.resources.nvidia.com/gpu: 2 binpack.resources.example.com/foo: 3
conformance	Prevent key pods, such as the pods in the kube-system namespace from being preempted.	-	-
gang	The gang add-on considers a group of pods as a whole to allocate resources.	-	-
priority	The priority add-on schedules pods based on the custom workload priority.	-	-
overcommit	Resources in a cluster are scheduled after being accumulated in a certain multiple to improve the workload enqueuing efficiency. If all workloads are Deployments, remove this add-on or set the raising factor to 2.0.	overcommit-factor: Raising factor. The default value is 1.2.	- plugins: - name: overcommit arguments: overcommit-factor: 2.0
drf	Schedules resources based on the container group dominant resources. The smallest dominant resources would be selected for priority scheduling.	-	-
predicates	Determines whether a task is bound to a node using a series of evaluation algorithms, such as node/pod affinity, taint tolerance, node port repetition, volume limits, and volume zone matching.	-	-
nodeorder	The nodeorder add-on scores all nodes for a task by using a series of scoring algorithms.	nodeaffinity.weight: Pods are scheduled based on the node affinity. The default value is 1. podaffinity.weight: Pods are scheduled based on the pod affinity. The default value is 1. leastrequested.weight: Pods are scheduled to the node with the least requested resources. The default value is 1. balancedresource.weight: Pods are scheduled to the node with balanced resource. The default value is 1. mostrequested.weight: Pods are scheduled to the node with the most requested resources. The default value is 0. tainttoleration.weight: Pods are scheduled to the node with a high taint tolerance. The default value is 1. imagelocality.weight: Pods are scheduled to the node where the required images exist. The default value is 1. selectorspread.weight: Pods are evenly scheduled to different nodes. The default value is 0. volumebinding.weight: Pods are scheduled to the node with the local PV delayed binding policy. The default value is 1. podtopologyspread.weight: Pods are scheduled based on the pod topology. The default value is 2.	- plugins: - name: nodeorder arguments: leastrequested.weight: 1 mostrequested.weight: 0 nodeaffinity.weight: 1 podaffinity.weight: 1 balancedresource.weight: 1 tainttoleration.weight: 1 imagelocality.weight: 1 volumebinding.weight: 1 podtopologyspread.weight: 2
cce-gpu-topology-predicate	GPU-topology scheduling preselection algorithm	-	-
cce-gpu-topology-priority	GPU-topology scheduling priority algorithm	-	-
cce-gpu	GPU resource allocation that supports decimal GPU configurations by working with the gpu add-on.	-	-
numaaware	NUMA topology scheduling	weight: Weight of the numa-aware add-on.	-
networkresource	The ENI requirement node can be preselected and filtered. The parameters are transferred by CCE and do not need to be manually configured.	NetworkType: network type (eni or vpc-router).	-
nodelocalvolume	Filters out nodes that do not meet local volume requirements.	-	-
nodeemptydirvolume	Filters out nodes that do not meet the emptyDir requirements.	-	-
nodeCSIscheduling	Filters out nodes that have everest component exceptions.	-	-

Click Install.

Modifying the volcano-scheduler Configurations Using the Console

Volcano allows you to configure the scheduler during installation, upgrade, and editing. The configuration will be synchronized to volcano-scheduler-configmap.

This section describes how to configure volcano-scheduler.

Only Volcano v1.7.1 and later support this function. On the new add-on page, options, such as plugins.eas_service and resource_exporter_enable, are replaced by default_scheduler_conf.

Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Add-ons. On the right of the displayed page, locate Volcano and click Install or Upgrade. In the Parameters area, configure the volcano-scheduler parameters.

Using resource_exporter:

{
    "ca_cert": "",
    "default_scheduler_conf": {
        "actions": "allocate, backfill",
        "tiers": [
            {
                "plugins": [
                    {
                        "name": "priority"
                    },
                    {
                        "name": "gang"
                    },
                    {
                        "name": "conformance"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "drf"
                    },
                    {
                        "name": "predicates"
                    },
                    {
                        "name": "nodeorder"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "cce-gpu-topology-predicate"
                    },
                    {
                        "name": "cce-gpu-topology-priority"
                    },
                    {
                        "name": "cce-gpu"
                    },
                    {
                        "name": "numa-aware" # add this also enable resource_exporter
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "nodelocalvolume"
                    },
                    {
                        "name": "nodeemptydirvolume"
                    },
                    {
                        "name": "nodeCSIscheduling"
                    },
                    {
                        "name": "networkresource"
                    }
                ]
            }
        ]
    },
    "server_cert": "",
    "server_key": ""
}

After the parameters are configured, you can use the functions of the numa-aware add-on and resource_exporter at the same time.

Using eas_service:

{
    "ca_cert": "",
    "default_scheduler_conf": {
        "actions": "allocate, backfill",
        "tiers": [
            {
                "plugins": [
                    {
                        "name": "priority"
                    },
                    {
                        "name": "gang"
                    },
                    {
                        "name": "conformance"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "drf"
                    },
                    {
                        "name": "predicates"
                    },
                    {
                        "name": "nodeorder"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "cce-gpu-topology-predicate"
                    },
                    {
                        "name": "cce-gpu-topology-priority"
                    },
                    {
                        "name": "cce-gpu"
                    },
                    {
                        "name": "eas",
                        "custom": {
                            "availability_zone_id": "",
                            "driver_id": "",
                            "endpoint": "",
                            "flavor_id": "",
                            "network_type": "",
                            "network_virtual_subnet_id": "",
                            "pool_id": "",
                            "project_id": "",
                            "secret_name": "eas-service-secret"
                        }
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "nodelocalvolume"
                    },
                    {
                        "name": "nodeemptydirvolume"
                    },
                    {
                        "name": "nodeCSIscheduling"
                    },
                    {
                        "name": "networkresource"
                    }
                ]
            }
        ]
    },
    "server_cert": "",
    "server_key": ""
}

Using ief:

{
    "ca_cert": "",
    "default_scheduler_conf": {
        "actions": "allocate, backfill",
        "tiers": [
            {
                "plugins": [
                    {
                        "name": "priority"
                    },
                    {
                        "name": "gang"
                    },
                    {
                        "name": "conformance"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "drf"
                    },
                    {
                        "name": "predicates"
                    },
                    {
                        "name": "nodeorder"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "cce-gpu-topology-predicate"
                    },
                    {
                        "name": "cce-gpu-topology-priority"
                    },
                    {
                        "name": "cce-gpu"
                    },
                    {
                        "name": "ief",
                        "enableBestNode": true
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "nodelocalvolume"
                    },
                    {
                        "name": "nodeemptydirvolume"
                    },
                    {
                        "name": "nodeCSIscheduling"
                    },
                    {
                        "name": "networkresource"
                    }
                ]
            }
        ]
    },
    "server_cert": "",
    "server_key": ""
}

Retaining the Original Configurations of volcano-scheduler-configmap

If you want to use the original configurations after the add-on is upgraded, perform the following steps:

Check and back up the original volcano-scheduler-configmap configuration.

Example:

# kubectl edit cm volcano-scheduler-configmap -n kube-system
apiVersion: v1
data:
  default-scheduler.conf: |-
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
    - plugins:
      - name: drf
      - name: predicates
      - name: nodeorder
      - name: binpack
        arguments:
          binpack.cpu: 100
          binpack.weight: 10
          binpack.resources: nvidia.com/gpu
          binpack.resources.nvidia.com/gpu: 10000
    - plugins:
      - name: cce-gpu-topology-predicate
      - name: cce-gpu-topology-priority
      - name: cce-gpu
    - plugins:
      - name: nodelocalvolume
      - name: nodeemptydirvolume
      - name: nodeCSIscheduling
      - name: networkresource

Enter the custom content in the Parameters area on the console.

{
    "ca_cert": "",
    "default_scheduler_conf": {
        "actions": "enqueue, allocate, backfill",
        "tiers": [
            {
                "plugins": [
                    {
                        "name": "priority"
                    },
                    {
                        "name": "gang"
                    },
                    {
                        "name": "conformance"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "drf"
                    },
                    {
                        "name": "predicates"
                    },
                    {
                        "name": "nodeorder"
                    },
                    {
                        "name": "binpack",
                        "arguments": {
                            "binpack.cpu": 100,
                            "binpack.weight": 10,
                            "binpack.resources": "nvidia.com/gpu",
                            "binpack.resources.nvidia.com/gpu": 10000
                        }
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "cce-gpu-topology-predicate"
                    },
                    {
                        "name": "cce-gpu-topology-priority"
                    },
                    {
                        "name": "cce-gpu"
                    }
                ]
            },
            {
                "plugins": [
                    {
                        "name": "nodelocalvolume"
                    },
                    {
                        "name": "nodeemptydirvolume"
                    },
                    {
                        "name": "nodeCSIscheduling"
                    },
                    {
                        "name": "networkresource"
                    }
                ]
            }
        ]
    },
    "server_cert": "",
    "server_key": ""
}

After the parameters are configured, the original content in volcano-scheduler-configmap will be overwritten. You must check whether volcano-scheduler-configmap has been modified during the upgrade. If volcano-scheduler-configmap has been modified, synchronize the modification to the upgrade page.

Related Operations

Change History

You are advised to upgrade Volcano to the latest version that matches the cluster.

**Table 3** Cluster version mapping
Cluster Version	Add-on Version
v1.25	1.7.1 and 1.7.2
v1.23	1.7.1 and 1.7.2
v1.21	1.7.1 and 1.7.2
v1.19.16	1.3.7, 1.3.10, 1.4.5, 1.7.1, and 1.7.2
v1.19	1.3.7, 1.3.10, and 1.4.5
v1.17 (End of maintenance)	1.3.7, 1.3.10, and 1.4.5
v1.15 (End of maintenance)	1.3.7, 1.3.10, and 1.4.5

**Table 4** CCE add-on versions
Add-on Version	Supported Cluster Version	Updated Feature
1.9.1	/v1.19.16.\|v1.21.\|v1.23.\|v1.25./	Fixed the issue that the counting pipeline pod of the networkresource add-on occupies supplementary network interfaces (Sub-ENI). Fixed the issue where the binpack add-on scores nodes with insufficient resources. Fixed the issue of processing resources in the pod with unknown end status. Optimized event output. Supports HA deployment by default.
1.7.2	/v1.19.16.\|v1.21.\|v1.23.\|v1.25./	Supported Kubernetes 1.25. Improved Volcano scheduling.
1.7.1	/v1.19.16.\|v1.21.\|v1.23.\|v1.25./	Supported Kubernetes 1.25.
1.6.5	/v1.19.\|v1.21.\|v1.23.*/	Served as the CCE default scheduler. Supported unified scheduling in hybrid deployments.
1.4.5	/v1.17.\|v1.19.\|v1.21.*/	Changed the deployment mode of volcano-scheduler from statefulset to deployment. Fixed the issue that pods cannot be automatically migrated when the node is abnormal.
1.4.2	/v1.15.\|v1.17.\|v1.19.\|v1.21./	Resolved the issue that cross-GPU allocation fails. Supported the updated EAS API.
1.3.3	/v1.15.\|v1.17.\|v1.19.\|v1.21./	Fixed the scheduler crash issue caused by GPU exceptions and the admission failure issue for privileged init containers.
1.3.1	/v1.15.\|v1.17.\|v1.19.*/	Upgraded the Volcano framework to the latest version. Supported Kubernetes 1.19. Added the numa-aware add-on. Fixed the deployment scaling issue in the multi-queue scenario. Adjusted the algorithm add-on enabled by default.
1.2.5	/v1.15.\|v1.17.\|v1.19.*/	Fixed the OutOfcpu issue in some scenarios. Fixed the issue that pods cannot be scheduled when some capabilities are set for a queue. Made the log time of the volcano component consistent with the system time. Fixed the issue of preemption between multiple queues. Fixed the issue that the result of the ioaware add-on does not meet the expectation in some extreme scenarios. Supported hybrid clusters.
1.2.3	/v1.15.\|v1.17.\|v1.19.*/	Fixed the training task OOM issue caused by insufficient precision. Fixed the GPU scheduling issue in CCE 1.15 and later versions. Rolling upgrade of CCE versions during task distribution is not supported. Fixed the issue where the queue status is unknown in certain scenarios. Fixed the issue where a panic occurs when a PVC is mounted to a job in a specific scenario. Fixed the issue that decimals cannot be configured for GPU jobs. Added the ioaware add-on. Added the ring controller.

Volcano

Introduction

Installing the Add-on

Modifying the volcano-scheduler Configurations Using the Console

Retaining the Original Configurations of volcano-scheduler-configmap

Related Operations

Change History

Feedback

Was this page helpful?