文档首页/
AI开发平台ModelArts/
最佳实践/
LLM大语言模型训练/
主流开源大模型基于Lite Cluster适配AscendFactory NPU训练解决方案/
训练准备工作/
准备训练config.yaml文件
更新时间:2025-10-22 GMT+08:00
准备训练config.yaml文件
config.yaml文件用于启动训练作业时调度k8s集群,具体模板如下,请用户参考下面模板制作。
apiVersion: batch.volcano.sh/v1alpha1
kind: HyperJob
metadata:
name: training-test2025
labels:
ring-controller.cce: ascend-1980
spec:
plugins:
configmap1980:
- --rank-table-version=v2
replicatedJobs:
- replicas: 2 # vcjob的数量,和后面启动命令及RANK有联动
name: vcjob-test # vcjob的名称
template:
minAvailable: 1 # vcjob内最小可用数量,AI任务场景等于每个vcjob内副本数
tasks:
- replicas: 1 # Pod的副本数,和后面启动命令及RANK有联动
name: worker-containerd # Pod的名称
policies:
- event: PodEvicted
action: RestartJob
template:
spec:
volumes:
- name: ascend-driver
hostPath: {path: /usr/local/Ascend/driver}
- name: ascend-add-ons
hostPath: {path: /usr/local/Ascend/add-ons}
- name: npu-smi
hostPath: {path: /usr/local/sbin/npu-smi}
- name: localtime
hostPath: {path: /etc/localtime}
- name: sfs
hostPath: {path: /mnt/sfs_turbo/}
- name: ascend-install
hostPath: {path: /etc/ascend_install.inf}
- name: dcmi
hostPath: {path: /usr/local/dcmi}
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: ${container_name} # 容器名称
image: ${image_name} # 镜像地址
command: ["/bin/bash", "-c"]
args:
- ${command}
resources:
requests:
huawei.com/ascend-1980: "16" # 申请的NPU数量,需与limits值保持一致,系统将根据申请NPU的数量决定Pod是否独占节点
limits:
huawei.com/ascend-1980: "16" # 修改项 每个节点的需求卡数,key保持不变。
ports:
- containerPort: 29500
name: trainport
protocol: TCP
volumeMounts:
- name: ascend-driver # 驱动挂载,保持不动
mountPath: /usr/local/Ascend/driver
- name: ascend-add-ons # 驱动挂载,保持不动
mountPath: /usr/local/Ascend/add-ons
- name: npu-smi
mountPath: /usr/local/sbin/npu-smi
- name: localtime
mountPath: /etc/localtime
- name: sfs
mountPath: /mnt/sfs_turbo/
- name: ascend-install
mountPath: /etc/ascend_install.inf
- name: dcmi
mountPath: /usr/local/dcmi
env:
- name: HYPERJOB_NAME
valueFrom:
fieldRef:
fieldPath: metadata.annotations['volcano.sh/hyperjob-name']
- name: HYPERJOB_REPLICATEDJOB_NAME
valueFrom:
fieldRef:
fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-name']
- name: TASK_SPEC
valueFrom:
fieldRef:
fieldPath: metadata.annotations['volcano.sh/task-spec']
- name: HYPERJOB_REPLICATEDJOB_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-index']
- name: TASK_INDEX
valueFrom:
fieldRef:
fieldPath: metadata.annotations['volcano.sh/task-index']
- name: VC_MAIN_HOSTS
value: "1"
- name: ip
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: RANK_TABLE_FILE
value: "/user/config/"
- name: GLOO_SOCKET_IFNAME
value: "${ifname}"
- name: TP_SOCKET_IFNAME
value: "${ifname}"
- name: HCCL_SOCKET_IFNAME
value: "${ifname}"
restartPolicy: OnFailure
参数说明:
- ${container_name} :容器名称,此处可以自己为container_name定义一个容器名称,例如ascend-train。
- ${image_name} :为步骤五:修改并上传镜像至SWR中,上传至SWR上的镜像链接。
- ${command} :使用config.yaml文件创建pod后,在容器内自动运行的命令,在启动训练任务时需要修改,详情见步骤一:生成训练command命令并修改。
- ${ifname}:使用ifconfig命令在主机上查询真实网卡名,替换yaml中对应网卡名。
- /mnt/sfs_turbo:为宿主机中默认挂载SFS Turbo的工作目录,目录下存放着训练所需代码、数据等文件。同样,/mnt/sfs_turbo 也可以映射至容器中,作为容器中挂载宿主机的目录。宿主机和容器使用不同的文件系统,为方便访问两个地址可以相同。
父主题: 训练准备工作
