Updated on 2025-08-19 GMT+08:00

NPU Topology-aware Affinity Scheduling on a Single Node

NPU topology-aware affinity scheduling on a single node is an intelligent resource management technology based on the hardware topology of Ascend AI processors. This technology optimizes resource allocation and network path selection, reduces compute resource fragments and network congestion, and maximizes NPU compute utilization. It can significantly improve the execution efficiency of AI training and inference jobs and implement efficient scheduling and management of Ascend compute resources.

In a CCE standard or Turbo cluster, NPU topology-aware affinity scheduling on a single node requires the cooperation of multiple components. The process is detailed as follows:

  1. huawei-npu-device-plugin in the CCE AI Suite (Ascend NPU) add-on queries and reports the topology of NPUs on each node.
  2. After a job is submitted, Volcano Scheduler selects the optimal node and NPU allocation solution based on the job requirements and NPU topology.
  3. After the pods are scheduled onto the node, kubelet allocates NPUs based on the allocation solution.
    Figure 1 How NPU topology-aware affinity scheduling is implemented

Principles of NPU Topology-aware Affinity Scheduling

Due to the differences in the interconnection architectures, the type of NPU topology-aware affinity scheduling varies depending on the hardware models. However, all types of scheduling comply with the following principles:

  • Primary principle: Ensure high-speed network channels.
  • Secondary principle: Minimize resource fragments.
  • Huawei Cache Coherent System (HCCS) is a high-speed bus used for interconnection between CPUs and NPUs.
  • Peripheral Component Interconnect Express (PCIe) is a high-speed serial computer expansion bus standard for connecting a computer's motherboard with peripherals.
  • Serial Input/Output (SIO) is a method of communicating data between devices, one master and one slave, by using two lines, one data line and one clock line.
Table 1 NPU topology-aware affinity scheduling on a single node

Hardware Type

Interconnection Mode

Affinity Policy Type

Constraints

Snt9A

There are eight NPUs (NPUs 0 to 7) on a node. NPUs 0 to 3 form one HCCS ring, and NPUs 4 to 7 form another HCCS ring. Each HCCS ring is a small-network group. The two groups are interconnected through PCIe.

The network transmission speed within a group is higher than that between groups.

Only small-network group affinity scheduling is supported. The details are as follows:

  • When a pod requests four or fewer NPUs, the NPUs allocated to the same pod must be in the same group. If the NPUs are not in the same group, the job will fail to run. If multiple nodes meet scheduling conditions, the system scores each node based on the NPU usage in the group, node NPU usage, and priority weight, and selects a node based on the scores. A higher weight indicates that a node with a higher NPU usage in the group is more likely to be selected.
  • When a pod requests eight NPUs, the node with eight available NPUs is selected.
  • A pod can request 1, 2, 4, or 8 NPUs.
  • The Volcano Scheduler add-on version is v1.6.4 or later.

Snt9B

There are eight NPUs on a node. The NPUs use the star topology and are interconnected through HCCS.

Only basic affinity scheduling is supported. The node with higher resource usages is preferentially selected.

  • A pod can request 1 to 8 NPUs.
  • The Volcano Scheduler add-on version is v1.15.8 or later.

Snt9C

There are eight training cards on a node. Each training card forms a die and has two NPUs. NPUs on a training card are interconnected through SIO, and two training cards are interconnected through HCCS.

The network transmission speed within a die is higher than that between dies.

Only die affinity scheduling is supported. There are two modes: hard affinity and no affinity.

  • Hard affinity (default configuration):
    • If only one NPU is requested but multiple nodes meet scheduling conditions, the system scores each node based on the NPU usage in the die, node NPU usage, and priority weight, and selects a node based on the scores. A higher weight indicates that a node with a higher NPU usage in the die is more likely to be selected.
    • If 2, 4, or 8 NPUs are requested, scheduling is performed by training card. If multiple nodes meet the conditions, the node with higher resource usages is preferentially selected.
    • If 16 NPUs are requested, the node with 16 available NPUs is selected.
  • No affinity: indicates basic affinity scheduling. The node with higher resource usages is preferentially selected.
  • Hard affinity
    • A pod can request 1, 2, 4, 8, or 16 NPUs.
    • The Volcano Scheduler add-on version is v1.15.8 or later.
  • No affinity
    • A pod can request 1 to 16 NPUs.
    • The Volcano Scheduler add-on version is v1.15.8 or later.

Snt3P

There are eight NPUs on a node. The NPUs use the star topology and are interconnected through PCIe.

Only basic affinity scheduling is supported. The node with higher resource usages is preferentially selected.

  • A pod can request 1 to 8 NPUs.
  • The Volcano Scheduler add-on version is v1.15.8 or later.

Snt3P IDUO2

There are four inference cards on a node. Each inference card forms a die and has two NPUs. NPUs on an inference card are interconnected through HCCS, and two inference cards are interconnected through PCIe.

The network transmission speed within a die is higher than that between dies.

Only die affinity scheduling is supported. There are two modes: hard affinity and no affinity.

  • Hard affinity (default configuration):
    • If only one NPU is requested but multiple nodes meet scheduling conditions, the system scores each node based on the NPU usage in the die, node NPU usage, and priority weight, and selects a node based on the scores. A higher weight indicates that a node with a higher NPU usage in the die is more likely to be selected.
    • If 2 or 4 NPUs are requested, scheduling is performed by inference card. If multiple nodes meet the conditions, the node with higher resource usages is preferentially selected.
    • If 8 NPUs are requested, the node with 8 available NPUs is selected.
  • No affinity: indicates basic affinity scheduling. The node with higher resource usages is preferentially selected.
  • Hard affinity:
    • A pod can request 1, 2, 4, or 8 NPUs.
    • The Volcano Scheduler add-on version is v1.15.8 or later.
  • No affinity
    • A pod can request 1 to 8 NPUs.
    • The Volcano Scheduler add-on version is v1.15.8 or later.

Prerequisites

  • A CCE standard or Turbo cluster has been created. Different types of affinity scheduling have their own requirements on cluster versions:
    • Die affinity scheduling: The cluster version must be v1.23.18, v1.25.13, v1.27.10, v1.28.8, v1.29.4, v1.30.1, or later.
    • Small-network group affinity scheduling: The cluster version must be v1.23 or later.
  • There are nodes of the corresponding type in the cluster. Snt9 nodes cannot be purchased for CCE standard or Turbo clusters. You can purchase Ascend Snt9 nodes in ModelArts in advance. After the purchase, CCE automatically accepts and manages the nodes. For details, see Creating a Standard Dedicated Resource Pool.
  • CCE AI Suite (Ascend NPU) of v2.1.23 or later has been installed. For details about how to install the add-on, see CCE AI Suite (Ascend NPU).
  • The Volcano Scheduler add-on has been installed. For details about the add-on version requirements, see Table 1. For details about how to install the add-on, see Volcano Scheduler.

Notes and Constraints

In a single pod, only one container can request NPU resources, and init containers cannot request NPU resources. Otherwise, the pod cannot be scheduled.

Enabling NPU Topology-aware Affinity Scheduling

The parameters vary depending on the types of affinity scheduling.

Only Snt9A nodes support small-network group affinity scheduling. Configure this function as needed.

  1. Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Overview. In the navigation pane, choose Settings. Then click the Scheduling tab.
  2. Set Default Cluster Scheduler to Volcano. After Volcano Scheduler is enabled, enable small-network group affinity scheduling.
  3. (Optional) In Expert mode, click Try Now and set gpuaffinitytopologyaware.weight for the cce-gpu-topology-priority plugin to specify the priority weight of each small-network group. Before configuring the priority weights, ensure that the cce-gpu-topology-priority plugin has been configured.

    ...
    tiers:
        - plugins:
            - name: priority
            - enableJobStarving: false
              enablePreemptable: false
              name: gang
            - name: conformance
        - plugins:
            - enablePreemptable: false
              name: drf
            - name: predicates
            - name: nodeorder
        - plugins:
            - name: cce-gpu-topology-predicate
            - name: cce-gpu-topology-priority  # If this parameter exists, the cce-gpu-topology-priority plugin has been configured. If not, add this parameter.
              arguments:   # Configure the priority weights.
                gpuaffinitytopologyaware.weight: 10
            - name: xgpu
        - plugins:
            - name: nodelocalvolume
            - name: nodeemptydirvolume
            - name: nodeCSIscheduling
            - name: networkresource
    ...
    Table 2 Parameters

    Parameter

    Example Value

    Description

    arguments.gpuaffinitytopologyaware.weight

    10

    Priority weight. The value ranges from 0 to 2147483647.

    A higher weight indicates that a node with a higher NPU usage in the group is more likely to be selected to reduce resource fragments.

  4. In the lower right corner of the tab, click Confirm Settings. In the displayed dialog box, confirm the modification and click Save.
  1. Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Overview. In the navigation pane, choose Settings. Then click the Scheduling tab.
  2. Set Default Cluster Scheduler to Volcano. After Volcano Scheduler is enabled, enable basic affinity scheduling without other configurations.
  3. In the lower right corner of the tab, click Confirm Settings. In the displayed dialog box, confirm the modification and click Save.

Only Snt9C and Snt3P IDUO2 nodes support die affinity scheduling. Configure this function as needed.

  1. Log in to the CCE console and click the cluster name to access the cluster console. In the navigation pane, choose Overview. In the navigation pane, choose Settings. Then click the Scheduling tab.
  2. In the Volcano Scheduler configuration, die affinity topology scheduling is enabled by default. You can take the following steps to verify it.

    1. Set Default Cluster Scheduler to Volcano and click Expert mode > Try Now.
      Figure 2 Expert mode > Try Now

    2. In the YAML file, check the parameters. Die affinity topology scheduling depends on the cce-gpu-topology-predicate and cce-gpu-topology-priority plugins, which are enabled by default in the Volcano Scheduler configuration. If the following parameters do not exist, manually configure them.
      ...
      tiers:
          - plugins:
              - name: priority
              - enableJobStarving: false
                enablePreemptable: false
                name: gang
              - name: conformance
          - plugins:
              - enablePreemptable: false
                name: drf
              - name: predicates
              - name: nodeorder
          - plugins:
              - name: cce-gpu-topology-predicate
              - name: cce-gpu-topology-priority
              - name: xgpu
          - plugins:
              - name: nodelocalvolume
              - name: nodeemptydirvolume
              - name: nodeCSIscheduling
              - name: networkresource
      ...
    3. After the cce-gpu-topology-predicate plugin is configured, the hard affinity is used by default. The following parameters can be configured for the cce-gpu-topology-predicate and cce-gpu-topology-priority plugins.
      ...
      tiers:
          - plugins:
              - name: priority
              - enableJobStarving: false
                enablePreemptable: false
                name: gang
              - name: conformance
          - plugins:
              - enablePreemptable: false
                name: drf
              - name: predicates
              - name: nodeorder
          - plugins:
              - name: cce-gpu-topology-predicate
                arguments:   # No affinity is configured
                  npu-die-affinity: none    
              - name: cce-gpu-topology-priority
                arguments:   # Configure the priority weight value, which is only applied in hard affinity.
                  gpuaffinitytopologyaware.weight: 10
              - name: xgpu
          - plugins:
              - name: nodelocalvolume
              - name: nodeemptydirvolume
              - name: nodeCSIscheduling
              - name: networkresource
      ...
      Table 3 Parameters

      Parameter

      Example Value

      Description

      arguments.npu-die-affinity

      none

      Die affinity scheduling. The options are as follows:

      • required: Hard affinity is used.
      • none: No affinity is configured.

      If this parameter is not specified, hard affinity is used by default.

      arguments.gpuaffinitytopologyaware.weight

      10

      Priority weight, which is only applied in hard affinity. The value ranges from 0 to 2147483647.

      A higher weight indicates that a node with a higher NPU usage in the die is more likely to be selected to reduce resource fragments.

    4. Click Save in the lower right corner.

  3. In the lower right corner of the tab, click Confirm Settings. In the displayed dialog box, confirm the modification and click Save.

Use Case

The following uses die affinity scheduling as an example. Assume that an Snt3P IDUO2 node has four inference cards, each of which has two NPUs. There are three remaining NPUs on two inference cards, with two NPUs on one inference card and one NPU on the other inference card. A Volcano job needs to be created, with the number of pods set to 1 and the number of NPUs to 2, to verify that the entire inference card can be scheduled.

  1. Create a Volcano job for executing tasks.

    1. Create a YAML file for the Volcano job.
      vim volcano-job.yaml
      Below is the file content (Only one container in a pod can request NPU resources. Init containers cannot request NPUs. Otherwise, the pod cannot be scheduled):
      apiVersion: batch.volcano.sh/v1alpha1 
      kind: Job 
      metadata: 
        name: job-test
      spec:
        maxRetry: 10000    # Maximum number of retries when the job fails
        schedulerName: volcano
        tasks: 
        - replicas: 1 
          name: worker
          maxRetry: 10000 
          template: 
            metadata: 
            spec:  
              containers:  
              - image: busybox 
                command: ["/bin/sh", "-c", "sleep 1000000"]  
                imagePullPolicy: IfNotPresent    
                name: running       
                resources:     
                  requests:    
                    cpu: 1    
                    "huawei.com/ascend-310": 2    
                  limits:     
                    cpu: 1    
                    "huawei.com/ascend-310": 2  
              restartPolicy: OnFailure
      • The parameters in resources.requests are described as follows:
        • "huawei.com/ascend-1980": indicates the number of NPUs that can be requested on Snt9C nodes. The value can be 1, 2, 4, 8, or 16.
        • "huawei.com/ascend-310": indicates the NPU resources requested on an Snt3P IDUO2 node. The value can be 1, 2, 4, or 8.
    2. Create the Volcano job.
      kubectl apply -f volcano-job.yaml

      Information similar to the following is displayed:

      job.batch.volcano.sh/job-test created
    3. Check whether the pod is successfully scheduled.
      kubectl get pod

      If the following information is displayed, the Volcano job has been executed and the pod has been scheduled:

      NAME                         READY   STATUS    RESTARTS      AGE 
      job-test-worker-0            1/1     Running   0             20s

  2. Check the NPUs allocated to the pod.

    kubectl describe pod job-test-worker-0

    On the Snt3P IDUO2 node, the NPUs are numbered 0, 1, 2, and so on by default. NPU 0 and NPU 1 are on the same inference card, and NPU 2 and NPU 3 are on the same inference card, and so on. According to the command output, NPU 2 and NPU 3 are allocated to the pod, and the two NPUs are on the same inference card, meeting the die affinity scheduling policy.

    Name:             job-test-worker-0 
    Namespace:        default 
    Priority:         0Service 
    Account:  default 
    Node:             192.168.147.31/192.168.147.31 
    Start Time:       Mon, 09 Sep 2024 21:23:01 +0800 
    Labels:           volcano.sh/job-name=job-test
                      volcano.sh/job-namespace=default  
                      volcano.sh/queue-name=default   
                      volcano.sh/task-index=0  
                      volcano.sh/task-spec=worker 
    Annotations:      huawei.com/AscendReal: Ascend310-2,Ascend310-3    
                      huawei.com/kltDev: Ascend310-2,Ascend310-3  
                      scheduling.cce.io/gpu-topology-placement: huawei.com/ascend-1980=0x0c