GPU Scheduling

You can use GPUs in CCE containers.

Prerequisites

A GPU node has been ready for use. For details, see Buying a Node.
The gpu-beta add-on has been installed. During the installation, select the GPU driver on the node. For details, see gpu-beta.

Using GPUs

Create a workload and request GPUs. You can specify the number of GPUs as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      containers:
      - image: nginx:perl
        name: container-0
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
            nvidia.com/gpu: 1   # Number of requested GPUs
          limits:
            cpu: 250m
            memory: 512Mi
            nvidia.com/gpu: 1   # Maximum number of GPUs that can be used
      imagePullSecrets:
      - name: default-secret

nvidia.com/gpu specifies the number of GPUs to be requested. The value can be smaller than 1. For example, nvidia.com/gpu: 0.5 indicates that multiple pods share a GPU.

After nvidia.com/gpu is specified, workloads will not be scheduled to nodes without GPUs. If GPUs are insufficient, a Kubernetes event similar to "0/2 nodes are available: 2 Insufficient nvidia.com/gpu." will be reported.

To use GPUs on the CCE console, select the GPU quota and specify the percentage of GPUs reserved for the container when creating a workload.

Figure 1 Using GPUs
Click to enlarge

GPU Node Labels

CCE will label GPU-enabled nodes that are ready to use. Different types of GPU-enabled nodes have different labels.

Figure 2 GPU node labels
Click to enlarge

When using GPUs, you can enable the affinity between pods and nodes based on labels so that the pods can be scheduled to the correct nodes.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gpu-test
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: gpu-test
  template:
    metadata:
      labels:
        app: gpu-test
    spec:
      nodeSelector:
        accelerator: nvidia-t4
      containers:
      - image: nginx:perl
        name: container-0
        resources:
          requests:
            cpu: 250m
            memory: 512Mi
            nvidia.com/gpu: 1   # Number of requested GPUs
          limits:
            cpu: 250m
            memory: 512Mi
            nvidia.com/gpu: 1   # Maximum number of GPUs that can be used
      imagePullSecrets:
      - name: default-secret

Parent Topic: Workloads

Previous topic: Managing Pods

Next topic: NPU Scheduling