更新时间:2024-08-17 GMT+08:00

优先级调度与抢占

优先级表示一个作业相对于其他作业的重要性,Volcano兼容Kubernetes中的Pod优先级定义(PriorityClass)。启用该能力后,调度器将优先保障高优先级业务调度。集群资源不足时,调度器主动驱逐低优先级业务,保障调度高优先级业务可以正常调度。

前提条件

优先级调度与抢占介绍

用户在集群中运行的业务丰富多样,包括核心业务、非核心业务,在线业务、离线业务等,根据业务的重要程度和SLA要求,可以对不同业务类型设置相应的高优先级。比如对核心业务和在线业务设置高优先级,可以保证该类业务优先获取集群资源。当集群资源被非核心业务占用,整体资源不足时,如果有新的核心业务提交部署请求,可以通过抢占的方式驱逐部分非核心业务,释放集群资源用于核心业务的调度运行。

CCE集群支持的优先级调度如表1所示。

表1 业务优先级保障调度

调度类型

说明

基于优先级调度

调度器优先保障高优先级业务运行,但不会主动驱逐已运行的低优先级业务。基于优先级调度配置默认开启,不支持关闭。

基于优先级抢占调度

当集群资源不足时,调度器主动驱逐低优先级业务,保障高优先级业务正常调度。

配置优先级调度与抢占策略

安装Volcano后,您可通过“配置中心 > 调度配置”页面选择开启或关闭优先级抢占调度能力。

  1. 登录CCE控制台。
  2. 单击集群名称进入集群,在左侧选择“配置中心”,在右侧选择“调度配置”页签。
  3. “业务优先级保障调度”配置中,进行优先级调度配置。

    • 基于优先级调度:调度器优先保障高优先级业务运行,但不会主动驱逐已运行的低优先级业务。基于优先级调度配置默认开启,不支持关闭。
    • 基于优先级抢占调度:将Volcano调度器设置为集群默认调度器时,支持基于优先级抢占调度。当集群资源不足时,调度器主动驱逐低优先级业务,保障高优先级业务正常调度。
      • 开启优先级抢占调度时,不支持使用Pod延迟创建。
      • 优先级抢占暂不支持eni/sub-eni自定义资源、hostPort端口的抢占。
    图1 业务优先级保障调度

  4. 修改完成后,单击“确认配置”
  5. 配置完成后,可以在工作负载或Volcano Job中使用优先级定义(PriorityClass)进行优先级调度。

    1. 创建一个或多个优先级定义(PriorityClass)。
      apiVersion: scheduling.k8s.io/v1
      kind: PriorityClass
      metadata:
        name: high-priority
      value: 1000000
      globalDefault: false
      description: ""
    2. 创建工作负载或Volcano Job,并指定priorityClassName。
      • 工作负载
        apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: high-test
          labels:
            app: high-test
        spec:
          replicas: 5
          selector:
            matchLabels:
              app: test
          template:
            metadata:
              labels:
                app: test
            spec:
              priorityClassName: high-priority
              schedulerName: volcano
              containers:
              - name: test
                image: busybox
                imagePullPolicy: IfNotPresent
                command: ['sh', '-c', 'echo "Hello, Kubernetes!" && sleep 3600']
                resources:
                  requests:
                    cpu: 500m
                  limits:
                    cpu: 500m
      • Volcano Job
        apiVersion: batch.volcano.sh/v1alpha1
        kind: Job
        metadata:
          name: vcjob
        spec:
          schedulerName: volcano
          minAvailable: 4
          priorityClassName: high-priority
          tasks:
            - replicas: 4
              name: "test"
              template:
                spec:
                  containers:
                    - image: alpine
                      command: ["/bin/sh", "-c", "sleep 1000"]
                      imagePullPolicy: IfNotPresent
                      name: running
                      resources:
                        requests:
                          cpu: "1"
                  restartPolicy: OnFailure

基于优先级调度示例

如果集群中存在两个空闲节点,存在3个优先级的工作负载,分别为high-priority,med-priority,low-priority,首先运行high-priority占满集群资源,然后提交med-priority,low-priority的工作负载,由于集群资源全部被更高优先级工作负载占用,med-priority,low-priority的工作负载为pending状态,当high-priority工作负载结束,按照优先级调度原则,med-priority工作负载将优先调度。

  1. 通过priority.yaml创建3个优先级定义(PriorityClass),分别为:high-priority,med-priority,low-priority。

    priority.yaml文件内容如下:
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: high-priority
    value: 100
    globalDefault: false
    description: "This priority class should be used for volcano job only."
    ---
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: med-priority
    value: 50
    globalDefault: false
    description: "This priority class should be used for volcano job only."
    ---
    apiVersion: scheduling.k8s.io/v1
    kind: PriorityClass
    metadata:
      name: low-priority
    value: 10
    globalDefault: false
    description: "This priority class should be used for volcano job only."
    创建PriorityClass:
    kubectl apply -f priority.yaml

  2. 查看优先级定义信息。

    kubectl get PriorityClass
    回显如下:
    NAME                      VALUE        GLOBAL-DEFAULT   AGE
    high-priority             100          false            97s
    low-priority              10           false            97s
    med-priority              50           false            97s
    system-cluster-critical   2000000000   false            6d6h
    system-node-critical      2000001000   false            6d6h

  3. 创建高优先级工作负载high-priority-job,占用集群的全部资源。

    high-priority-job.yaml

    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
      name: priority-high
    spec:
      schedulerName: volcano
      minAvailable: 4
      priorityClassName: high-priority
      tasks:
        - replicas: 4
          name: "test"
          template:
            spec:
              containers:
                - image: alpine
                  command: ["/bin/sh", "-c", "sleep 1000"]
                  imagePullPolicy: IfNotPresent
                  name: running
                  resources:
                    requests:
                      cpu: "1"
              restartPolicy: OnFailure

    执行以下命令下发作业:

    kubectl apply -f high_priority_job.yaml

    通过 kubectl get pod 查看Pod运行信息,如下:

    NAME                   READY   STATUS    RESTARTS   AGE
    priority-high-test-0   1/1     Running   0          3s
    priority-high-test-1   1/1     Running   0          3s
    priority-high-test-2   1/1     Running   0          3s
    priority-high-test-3   1/1     Running   0          3s

    此时,集群节点资源已全部被占用。

  4. 创建中优先级工作负载med-priority-job和低优先级工作负载low-priority-job。

    med-priority-job.yaml

    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
      name: priority-medium
    spec:
      schedulerName: volcano
      minAvailable: 4
      priorityClassName: med-priority
      tasks:
        - replicas: 4
          name: "test"
          template:
            spec:
              containers:
                - image: alpine
                  command: ["/bin/sh", "-c", "sleep 1000"]
                  imagePullPolicy: IfNotPresent
                  name: running
                  resources:
                    requests:
                      cpu: "1"
              restartPolicy: OnFailure

    low-priority-job.yaml

    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    metadata:
      name: priority-low
    spec:
      schedulerName: volcano
      minAvailable: 4
      priorityClassName: low-priority
      tasks:
        - replicas: 4
          name: "test"
          template:
            spec:
              containers:
                - image: alpine
                  command: ["/bin/sh", "-c", "sleep 1000"]
                  imagePullPolicy: IfNotPresent
                  name: running
                  resources:
                    requests:
                      cpu: "1"
              restartPolicy: OnFailure

    执行以下命令下发作业:

    kubectl apply -f med_priority_job.yaml
    kubectl apply -f low_priority_job.yaml

    通过 kubectl get pod 查看Pod运行信息,集群资源不足,Pod处于Pending状态,如下:

    NAME                     READY   STATUS    RESTARTS   AGE
    priority-high-test-0     1/1     Running   0          3m29s
    priority-high-test-1     1/1     Running   0          3m29s
    priority-high-test-2     1/1     Running   0          3m29s
    priority-high-test-3     1/1     Running   0          3m29s
    priority-low-test-0      0/1     Pending   0          2m26s
    priority-low-test-1      0/1     Pending   0          2m26s
    priority-low-test-2      0/1     Pending   0          2m26s
    priority-low-test-3      0/1     Pending   0          2m26s
    priority-medium-test-0   0/1     Pending   0          2m36s
    priority-medium-test-1   0/1     Pending   0          2m36s
    priority-medium-test-2   0/1     Pending   0          2m36s
    priority-medium-test-3   0/1     Pending   0          2m36s

  5. 删除high_priority_job工作负载,释放集群资源,med_priority_job会被优先调度。

    执行 kubectl delete -f high_priority_job.yaml 释放集群资源,查看Pod的调度信息,如下:

    NAME                     READY   STATUS    RESTARTS   AGE
    priority-low-test-0      0/1     Pending   0          5m18s
    priority-low-test-1      0/1     Pending   0          5m18s
    priority-low-test-2      0/1     Pending   0          5m18s
    priority-low-test-3      0/1     Pending   0          5m18s
    priority-medium-test-0   1/1     Running   0          5m28s
    priority-medium-test-1   1/1     Running   0          5m28s
    priority-medium-test-2   1/1     Running   0          5m28s
    priority-medium-test-3   1/1     Running   0          5m28s

基于优先级抢占调度示例

  1. 登录CCE控制台,进入“配置中心 > 调度配置”页面。
  2. 修改以下配置并确认。

    1. 设置集群默认调度器:选择“Volcano调度器”
    2. 业务优先级保障调度:选择开启“基于优先级抢占调度”能力。

  3. 在基于优先级调度的场景下,再次下发high_priority_job工作负载,则调度器会驱逐med_priority_job工作负载,保证high_priority_job可以成功调度。

    执行 kubectl apply -f high_priority_job.yaml,作业下发成功,查看Pod状态信息,如下:

    NAME                     READY   STATUS        RESTARTS   AGE
    priority-high-test-0     0/1     Pending       0          2s
    priority-high-test-1     0/1     Pending       0          2s
    priority-high-test-2     0/1     Pending       0          2s
    priority-high-test-3     0/1     Pending       0          2s
    priority-low-test-0      0/1     Pending       0          14s
    priority-low-test-1      0/1     Pending       0          14s
    priority-low-test-2      0/1     Pending       0          14s
    priority-low-test-3      0/1     Pending       0          14s
    priority-medium-test-0   1/1     Terminating   0          21s
    priority-medium-test-1   1/1     Terminating   0          21s
    priority-medium-test-2   1/1     Terminating   0          21s
    priority-medium-test-3   1/1     Terminating   0          21s

    等待med_priority_job资源释放成功后,high_priority_job成功调度,如下:

    NAME                     READY   STATUS    RESTARTS   AGE
    priority-high-test-0     1/1     Running   0          70s
    priority-high-test-1     1/1     Running   0          70s
    priority-high-test-2     1/1     Running   0          70s
    priority-high-test-3     1/1     Running   0          70s
    priority-low-test-0      0/1     Pending   0          82s
    priority-low-test-1      0/1     Pending   0          82s
    priority-low-test-2      0/1     Pending   0          82s
    priority-low-test-3      0/1     Pending   0          82s
    priority-medium-test-0   0/1     Pending   0          37s
    priority-medium-test-1   0/1     Pending   0          36s
    priority-medium-test-2   0/1     Pending   0          37s
    priority-medium-test-3   0/1     Pending   0          37s

    在节点资源无法满足high_priority_job的情况下,volcano-scheduler的优先级抢占机制将被启用,驱逐med_priority_job后,将high_priority_job部署到节点上。在Cluster Autoscaler新扩容节点后,volcano-scheduler再将med_priority_job调度到新节点上。

    根据上述结果,在启用优先级抢占调度时,建议您开启节点弹性,以保证集群资源的按需供给,进而保证应用SLA。

基于优先级抢占调度的亲和/反亲和示例

在Pod间亲和场景中,不推荐Pod与比其优先级低的Pod亲和。如果pending状态的Pod与节点上的一个或多个较低优先级Pod具有Pod间亲和性,对较低优先级的Pod发起抢占时,会无法满足Pod间亲和性规则,抢占规则和亲和性规则产生矛盾。 在这种情况下,调度程序无法保证pending状态的Pod可以被调度。推荐的解决方案是仅针对同等或更高优先级的Pod设置Pod间亲和性。详情请参见与低优先级Pod之间的Pod间亲和性

在Pod间亲和场景中,如果启用优先级抢占,当deploy1与比其优先级低的deploy2亲和,volcano-scheduler为保证业务自运维,将驱逐deploy3,并将deploy1调度到节点上。被驱逐的deploy3将会在新节点准备好后,调度到新节点上。

图2 与低优先级的Pod亲和场景

在Pod间反亲和场景中,如果启用优先级抢占,当deploy1与deploy2/3反亲和,volcano-scheduler为减少对其它业务的影响,将不驱逐deploy2和deploy3,而是在新节点准备好后,将deploy1调度到新节点上。

图3 与低优先级Pod反亲和场景