文档首页/
AI开发平台ModelArts/
最佳实践/
LLM大语言模型训练/
主流开源大模型基于Lite Cluster适配AscendFactory NPU训练解决方案/
训练准备工作/
准备训练config.yaml文件
更新时间:2025-10-22 GMT+08:00
准备训练config.yaml文件
config.yaml文件用于启动训练作业时调度k8s集群,具体模板如下,请用户参考下面模板制作。
apiVersion: batch.volcano.sh/v1alpha1 kind: HyperJob metadata: name: training-test2025 labels: ring-controller.cce: ascend-1980 spec: plugins: configmap1980: - --rank-table-version=v2 replicatedJobs: - replicas: 2 # vcjob的数量,和后面启动命令及RANK有联动 name: vcjob-test # vcjob的名称 template: minAvailable: 1 # vcjob内最小可用数量,AI任务场景等于每个vcjob内副本数 tasks: - replicas: 1 # Pod的副本数,和后面启动命令及RANK有联动 name: worker-containerd # Pod的名称 policies: - event: PodEvicted action: RestartJob template: spec: volumes: - name: ascend-driver hostPath: {path: /usr/local/Ascend/driver} - name: ascend-add-ons hostPath: {path: /usr/local/Ascend/add-ons} - name: npu-smi hostPath: {path: /usr/local/sbin/npu-smi} - name: localtime hostPath: {path: /etc/localtime} - name: sfs hostPath: {path: /mnt/sfs_turbo/} - name: ascend-install hostPath: {path: /etc/ascend_install.inf} - name: dcmi hostPath: {path: /usr/local/dcmi} hostNetwork: true dnsPolicy: ClusterFirstWithHostNet containers: - name: ${container_name} # 容器名称 image: ${image_name} # 镜像地址 command: ["/bin/bash", "-c"] args: - ${command} resources: requests: huawei.com/ascend-1980: "16" # 申请的NPU数量,需与limits值保持一致,系统将根据申请NPU的数量决定Pod是否独占节点 limits: huawei.com/ascend-1980: "16" # 修改项 每个节点的需求卡数,key保持不变。 ports: - containerPort: 29500 name: trainport protocol: TCP volumeMounts: - name: ascend-driver # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/driver - name: ascend-add-ons # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/add-ons - name: npu-smi mountPath: /usr/local/sbin/npu-smi - name: localtime mountPath: /etc/localtime - name: sfs mountPath: /mnt/sfs_turbo/ - name: ascend-install mountPath: /etc/ascend_install.inf - name: dcmi mountPath: /usr/local/dcmi env: - name: HYPERJOB_NAME valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/hyperjob-name'] - name: HYPERJOB_REPLICATEDJOB_NAME valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-name'] - name: TASK_SPEC valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/task-spec'] - name: HYPERJOB_REPLICATEDJOB_INDEX valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-index'] - name: TASK_INDEX valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/task-index'] - name: VC_MAIN_HOSTS value: "1" - name: ip valueFrom: fieldRef: fieldPath: status.hostIP - name: RANK_TABLE_FILE value: "/user/config/" - name: GLOO_SOCKET_IFNAME value: "${ifname}" - name: TP_SOCKET_IFNAME value: "${ifname}" - name: HCCL_SOCKET_IFNAME value: "${ifname}" restartPolicy: OnFailure
参数说明:
- ${container_name} :容器名称,此处可以自己为container_name定义一个容器名称,例如ascend-train。
- ${image_name} :为步骤五:修改并上传镜像至SWR中,上传至SWR上的镜像链接。
- ${command} :使用config.yaml文件创建pod后,在容器内自动运行的命令,在启动训练任务时需要修改,详情见步骤一:生成训练command命令并修改。
- ${ifname}:使用ifconfig命令在主机上查询真实网卡名,替换yaml中对应网卡名。
- /mnt/sfs_turbo:为宿主机中默认挂载SFS Turbo的工作目录,目录下存放着训练所需代码、数据等文件。同样,/mnt/sfs_turbo 也可以映射至容器中,作为容器中挂载宿主机的目录。宿主机和容器使用不同的文件系统,为方便访问两个地址可以相同。
父主题: 训练准备工作