Volcano Scheduler

Add-on Overview

Volcano is a batch scheduling platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.

Add-on Parameters

**Table 1** Parameters
Parameter	Mandatory	Type	Description
basic	No	Table 2 object	Basic configuration parameters, which do not need to be specified
flavor	Yes	Table 3 object	Flavor parameters
custom	Yes	Table 4 object	Custom parameters

**Table 2** Configuration of basic
Parameter	Mandatory	Type	Description
swr_addr	Yes	String	Add-on download address, which does not need to be specified
swr_user	Yes	String	User who can download the add-on. This parameter does not need to be specified.
platform	Yes	String	Add-on platform, which does not need to be specified
escEndpoint	Yes	String	ECS address, which does not need to be specified
xccsEndpoint	Add-on versions ≥ 1.16.11: Yes Add-on versions < 1.16.11: No	String	XCCS service address, which does not need to be specified

**Table 3** Configuration of flavor
Parameter	Mandatory	Type	Description
description	No	String	Add-on description
name	Yes	String	Add-on specification name Add-on versions ≥ 1.14.7: Node50, Node200, Node1000, and custom-resources Add-on versions earlier than 1.14.7: HA, Single, and custom-resources
replicas	Yes	String	Number of pods. The default value is 2.
resources	Yes	resources object	Container resource (CPU and memory) quotas

**Table 4** Configuration of custom
Parameter	Mandatory	Type	Description
multiAZEnabled	No	Bool	Whether to enable multi-AZ deployment for the add-on. The default value is false. true: Volcano pods are deployed in different AZs based on the hard anti-affinity policy. false: Volcano pods are deployed in multiple AZs based on the soft anti-affinity policy.
controller_kube_api_qps	No	int	API server QPS of the controller component. The default value is 200.
scheduler_kube_api_qps	No	int	API server QPS of the scheduler component. The default value is 200.
admission_kube_api_qps	No	int	API server QPS of the admission component. The default value is 200.
update_pod_status_qps	No	int	Used to update the pod status QPS. The default value is 200.
admissions	No	string	Webhooks supported by Volcano
colocation_enable	No	string	Whether hybrid deployment is supported
oversubscription_ratio	No	int	Dynamic oversubscription ratio. The default value is 60.
oversubscription_method	No	string	Method of calculating oversubscribed resources. The options are nodeResource and podProfile. nodeResource is the default algorithm based on node resource usage, and podProfile is the algorithm based on pod profiling. By default, nodeResource is used.
oversubscription_profile_period	No	int	Interval for pod profiling, in seconds
workload_balancer_third_party_types	No	string	Character string consisting of group, version, and kind of a third-party workload
workload_balancer_score_annotation_key	No	string	Used to specify the score annotation key of a pod
node_match_expressions	No	Table 7	Expression for matching the Volcano Scheduler pods to nodes
tolerations	No	Table 6	The format is the same as that of Kubernetes tolerations. It is used to add taints to Volcano Scheduler pods.
oversubscription_ratio	No	int	Node resource overcommitment ratio in the Volcano scheduling environment
descheduler_enable	No	Bool	Whether rescheduling is supported
enable_workload_balancer	No	Bool	Whether load balancers are supported
default_scheduler_conf	Yes	yaml	The format is the same as that of the YAML for Volcano. For details, see Volcano Scheduler.
deschedulerPolicy	No	yaml	The format is the same as that of the YAML for Volcano descheduling configuration. For details, see Descheduling.

**Table 5** Data structure of the resources field
Parameter	Mandatory	Type	Description
limitsCpu	Yes	String	CPU size limit (unit: m) The default values are differentiated by component. For details about key components, see Volcano Scheduler.
limitsMem	Yes	String	Memory size limit (unit: Mi) The default values are differentiated by component. For details about key components, see Volcano Scheduler.
name	Yes	String	Add-on name
requestsCpu	Yes	String	Requested CPU size (unit: m) The default values are differentiated by component. For details about key components, see Volcano Scheduler.
requestsMem	Yes	String	Requested memory size (unit: Mi) The values are differentiated by component. For details about key components, see Volcano Scheduler.

**Table 6** Taints and tolerations
Parameter	Mandatory	Type	Description
key	No	String	Taint key
effect	No	String	Taint effect
operator	No	String	Operator
tolerationSeconds	No	Int	Toleration time window

**Table 7** nodeMatchExpresssion node affinity
Parameter	Mandatory	Type	Description
key	No	String	Taint key
values	No	List<String>	Node affinity name
operator	No	String	Operator

Example Request

{
	"kind": "Addon",
	"apiVersion": "v3",
	"metadata": {
		"annotations": {
			"addon.install/type": "install"
		}
	},
	"spec": {
		"clusterID": "ad24dc34-******-0255ac100030",
		"version": "1.16.8",
		"addonTemplateName": "volcano",
		"values": {
			"basic": {
				"ecsEndpoint": "x.x.x.x",
				"platform": "linux-amd64",
				"swr_addr": "swr.cn-north-7.myhuaweicloud.com",
				"swr_user": "hwofficial"
			},
			"flavor": {
				"description": "For 50 nodes, 5000 pods in cluster",
				"name": "Node50",
				"resources": [
					{
						"name": "volcano-scheduler",
						"limitsCpu": "2000m",
						"requestsCpu": "500m",
						"replicas": 2,
						"limitsMem": "2000Mi",
						"requestsMem": "500Mi"
					},
					{
						"name": "volcano-controller",
						"limitsCpu": "2000m",
						"requestsCpu": "500m",
						"replicas": 2,
						"limitsMem": "2000Mi",
						"requestsMem": "500Mi"
					},
					{
						"name": "volcano-admission",
						"limitsCpu": "500m",
						"requestsCpu": "200m",
						"replicas": 2,
						"limitsMem": "500Mi",
						"requestsMem": "500Mi"
					},
					{
						"limitsCpu": "200m",
						"limitsMem": "200Mi",
						"name": "volcano-agent",
						"requestsCpu": "100m",
						"requestsMem": "150Mi"
					},
					{
						"limitsCpu": "100m",
						"limitsMem": "100Mi",
						"name": "resource-exporter",
						"requestsCpu": "50m",
						"requestsMem": "50Mi"
					},
					{
						"limitsCpu": "1000m",
						"limitsMem": "512Mi",
						"name": "volcano-descheduler",
						"replicas": 2,
						"requestsCpu": "500m",
						"requestsMem": "256Mi"
					},
					{
						"limitsCpu": "500m",
						"limitsMem": "1000Mi",
						"name": "volcano-recommender",
						"replicas": 2,
						"requestsCpu": "300m",
						"requestsMem": "500Mi"
					},
					{
						"limitsCpu": "300m",
						"limitsMem": "300Mi",
						"name": "volcano-recommender-prometheus-adapter",
						"replicas": 2,
						"requestsCpu": "200m",
						"requestsMem": "200Mi"
					}
				],
				"size": "small",
				"category": [
					"CCE",
					"Turbo"
				]
			},
			"custom": {
				"admission_kube_api_qps": 200,
				"admissions": "/jobs/mutate,/jobs/validate,/podgroups/mutate,/pods/validate,/pods/mutate,/queues/mutate,/queues/validate,/eas/pods/mutate,/eas/pods/validate,/npu/jobs/validate,/resource/validate,/resource/mutate,/workloadbalancer/balancer/validate,/workloadbalancer/balancerpolicytemplate/validate",
				"colocation_enable": "false",
				"controller_kube_api_qps": 200,
				"default_scheduler_conf": {
					"actions": "allocate, backfill, preempt",
					"metrics": {
						"interval": "30s",
						"type": ""
					},
					"tiers": [
						{
							"plugins": [
								{
									"name": "priority"
								},
								{
									"enableJobStarving": false,
									"enablePreemptable": false,
									"name": "gang"
								},
								{
									"name": "conformance"
								}
							]
						},
						{
							"plugins": [
								{
									"enablePreemptable": false,
									"name": "drf"
								},
								{
									"name": "predicates"
								},
								{
									"name": "nodeorder"
								}
							]
						},
						{
							"plugins": [
								{
									"name": "cce-gpu-topology-predicate"
								},
								{
									"name": "cce-gpu-topology-priority"
								},
								{
									"name": "xgpu"
								}
							]
						},
						{
							"plugins": [
								{
									"name": "nodelocalvolume"
								},
								{
									"name": "nodeemptydirvolume"
								},
								{
									"name": "nodeCSIscheduling"
								},
								{
									"name": "networkresource"
								}
							]
						}
					]
				},
				"deschedulerPolicy": {
					"profiles": [
						{
							"name": "ProfileName",
							"pluginConfig": [
								{
									"args": {
										"nodeFit": true
									},
									"name": "DefaultEvictor"
								},
								{
									"args": {
										"evictableNamespaces": {
											"exclude": [
												"kube-system"
											]
										},
										"thresholds": {
											"cpu": 20,
											"memory": 20
										}
									},
									"name": "HighNodeUtilization"
								},
								{
									"args": {
										"evictableNamespaces": {
											"exclude": [
												"kube-system"
											]
										},
										"metrics": {
											"type": "prometheus_adaptor"
										},
										"nodeFit": true,
										"targetThresholds": {
											"cpu": 80,
											"memory": 85
										},
										"thresholds": {
											"cpu": 30,
											"memory": 30
										}
									},
									"name": "LoadAware"
								}
							],
							"plugins": {
								"balance": {
									"enabled": null
								}
							}
						}
					]
				},
				"descheduler_enable": "false",
				"deschedulingInterval": "10m",
				"enable_workload_balancer": false,
				"multiAZEnabled": false,
				"node_match_expressions": [],
				"oversubscription_method": "nodeResource",
				"oversubscription_profile_period": 300,
				"oversubscription_ratio": 60,
				"scheduler_kube_api_qps": 200,
				"tolerations": [
					{
						"effect": "NoExecute",
						"key": "node.kubernetes.io/not-ready",
						"operator": "Exists",
						"tolerationSeconds": 60
					},
					{
						"effect": "NoExecute",
						"key": "node.kubernetes.io/unreachable",
						"operator": "Exists",
						"tolerationSeconds": 60
					},
					{
						"effect": "NoSchedule",
						"key": "node.cilium.io/agent-not-ready",
						"operator": "Exists"
					}
				],
				"update_pod_status_qps": 50,
				"workload_balancer_score_annotation_key": "",
				"workload_balancer_third_party_types": "",
				"multiAZBalance": false
			}
		}
	}
}

Parent Topic: Add-on Instance Parameters

Previous topic: CCE AI Suite (Ascend NPU)

Next topic: CCE Secrets Manager for DEW