更新时间:2025-10-22 GMT+08:00
分享

准备训练config.yaml文件

config.yaml文件用于启动训练作业时调度k8s集群,具体模板如下,请用户参考下面模板制作。

apiVersion: batch.volcano.sh/v1alpha1
kind: HyperJob
metadata: 
  name: training-test2025
  labels:
    ring-controller.cce: ascend-1980
spec:
  plugins:
    configmap1980:
      - --rank-table-version=v2
  replicatedJobs:
   - replicas: 2 # vcjob的数量,和后面启动命令及RANK有联动
     name: vcjob-test # vcjob的名称
     template:
       minAvailable: 1 # vcjob内最小可用数量,AI任务场景等于每个vcjob内副本数
       tasks:
        - replicas: 1 # Pod的副本数,和后面启动命令及RANK有联动
          name: worker-containerd  # Pod的名称
          policies:
          - event: PodEvicted
            action: RestartJob
          template:
            spec:
              volumes:
                - name: ascend-driver
                  hostPath: {path: /usr/local/Ascend/driver}
                - name: ascend-add-ons
                  hostPath: {path: /usr/local/Ascend/add-ons}
                - name: npu-smi
                  hostPath: {path: /usr/local/sbin/npu-smi}
                - name: localtime
                  hostPath: {path: /etc/localtime}
                - name: sfs
                  hostPath: {path: /mnt/sfs_turbo/}
                - name: ascend-install
                  hostPath: {path: /etc/ascend_install.inf}
                - name: dcmi
                  hostPath: {path: /usr/local/dcmi}
              hostNetwork: true
              dnsPolicy: ClusterFirstWithHostNet
              containers:
              - name:  ${container_name} # 容器名称
                image: ${image_name}  # 镜像地址
                command: ["/bin/bash", "-c"]
                args:
                  - ${command}
                resources:
                  requests: 
                    huawei.com/ascend-1980: "16"  # 申请的NPU数量,需与limits值保持一致,系统将根据申请NPU的数量决定Pod是否独占节点
                  limits: 
                    huawei.com/ascend-1980: "16" # 修改项 每个节点的需求卡数,key保持不变。
                ports:
                - containerPort: 29500
                  name: trainport
                  protocol: TCP
                volumeMounts:
                - name: ascend-driver # 驱动挂载,保持不动
                  mountPath: /usr/local/Ascend/driver
                - name: ascend-add-ons # 驱动挂载,保持不动
                  mountPath: /usr/local/Ascend/add-ons
                - name: npu-smi
                  mountPath: /usr/local/sbin/npu-smi
                - name: localtime
                  mountPath: /etc/localtime
                - name: sfs
                  mountPath: /mnt/sfs_turbo/
                - name: ascend-install
                  mountPath: /etc/ascend_install.inf
                - name: dcmi
                  mountPath: /usr/local/dcmi
                env:
                - name: HYPERJOB_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['volcano.sh/hyperjob-name']
                - name: HYPERJOB_REPLICATEDJOB_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-name']
                - name: TASK_SPEC
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['volcano.sh/task-spec']
                - name: HYPERJOB_REPLICATEDJOB_INDEX
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-index']
                - name: TASK_INDEX
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['volcano.sh/task-index']
                - name: VC_MAIN_HOSTS
                  value: "1"
                - name: ip
                  valueFrom:
                    fieldRef:
                      fieldPath: status.hostIP
                - name: RANK_TABLE_FILE
                  value: "/user/config/"
                - name: GLOO_SOCKET_IFNAME
                  value: "${ifname}"
                - name: TP_SOCKET_IFNAME
                  value: "${ifname}"
                - name: HCCL_SOCKET_IFNAME
                  value: "${ifname}"
              restartPolicy: OnFailure

参数说明:

  • ${container_name} :容器名称,此处可以自己为container_name定义一个容器名称,例如ascend-train。
  • ${image_name} :为步骤五:修改并上传镜像至SWR中,上传至SWR上的镜像链接。
  • ${command} :使用config.yaml文件创建pod后,在容器内自动运行的命令,在启动训练任务时需要修改,详情见步骤一:生成训练command命令并修改
  • ${ifname}:使用ifconfig命令在主机上查询真实网卡名,替换yaml中对应网卡名。

  • /mnt/sfs_turbo:为宿主机中默认挂载SFS Turbo的工作目录,目录下存放着训练所需代码、数据等文件。同样,/mnt/sfs_turbo 也可以映射至容器中,作为容器中挂载宿主机的目录。宿主机和容器使用不同的文件系统,为方便访问两个地址可以相同。

相关文档