文档首页/
AI开发平台ModelArts/
最佳实践/
LLM大语言模型训练/
主流开源大模型基于Lite Cluster适配AscendFactory NPU训练解决方案/
日志方案/
CANN应用类日志存储至OBS配置方法
更新时间:2025-10-22 GMT+08:00
CANN应用类日志存储至OBS配置方法
创建OBS的PVC
训练作业关联PVC
若需要将日志写入对应的桶中,通常需要对训练的config.yaml做如下改造以引用上一步创建的PVC。
- 一个完整的yaml示例(供参考)
apiVersion: batch.volcano.sh/v1alpha1 kind: HyperJob metadata: name: hyperjob-test spec: replicatedJobs: - replicas: 2 # vcjob的数量,和后面启动命令的WORD_SIZE及RANK有联动 name: vcjob-test # vcjob的名称 template:WORD_SIZE minAvailable: 2 # vcjob内最小可用数量,AI任务场景等于每个vcjob内副本数 tasks: - replicas: 2 # Pod的副本数,和后面启动命令的WORD_SIZE及RANK有联动 name: worker # Pod的名称 policies: - event: PodEvicted action: RestartJob template: spec: volumes: - name: ascend-driver hostPath: {path: /usr/local/Ascend/driver} - name: ascend-add-ons hostPath: {path: /usr/local/Ascend/add-ons} - name: npu-smi hostPath: {path: /usr/local/sbin/npu-smi} - name: localtime hostPath: {path: /etc/localtime} - name: sfs hostPath: {path: /mnt/sfs_turbo/} - name: ascend-install hostPath: {path: /etc/ascend_install.inf} - name: dcmi hostPath: {path: /usr/local/dcmi} - name: pvc-obs-glj # pvc挂载信息 persistentVolumeClaim: claimName: pvc-obs-glj containers: - name: ${container_name} # 容器名称 image: ${image_name} # 镜像地址 command: ["/bin/bash", "-c"] args: - ${command} resources: requests: huawei.com/ascend-1980: "16" # 申请的NPU数量,需与limits值保持一致,系统将根据申请NPU的数量决定Pod是否独占节点 limits: huawei.com/ascend-1980: "16" # 修改项 每个节点的需求卡数,key保持不变。 ports: - containerPort: 29500 name: trainport protocol: TCP volumeMounts: - name: ascend-driver # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/driver - name: ascend-add-ons # 驱动挂载,保持不动 mountPath: /usr/local/Ascend/add-ons - name: npu-smi mountPath: /usr/local/sbin/npu-smi - name: localtime mountPath: /etc/localtime - name: sfs mountPath: /mnt/sfs_turbo/ - name: ascend-install mountPath: /etc/ascend_install.inf - name: dcmi mountPath: /usr/local/dcmi - name: pvc-obs-glj mountPath: /data/logs env: - name: NODE_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: spec.nodeName - name: JOB_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.labels['app'] - name: ASCEND_PROCESS_LOG_PATH value: /data/logs/$(JOB_NAME)/$(NODE_NAME)/ascend_plog/ - name: ASCEND_WORK_PATH value: /data/logs/$(JOB_NAME)/$(NODE_NAME)/ascend_work_path/log/ - name: HYPERJOB_NAME valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/hyperjob-name'] - name: HYPERJOB_REPLICATEDJOB_NAME valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-name'] - name: TASK_SPEC valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/task-spec'] - name: HYPERJOB_REPLICATEDJOB_INDEX valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-index'] - name: TASK_INDEX valueFrom: fieldRef: fieldPath: metadata.annotations['volcano.sh/task-index'] restartPolicy: OnFailure
如何启动一个带独立uuid的作业:
kubectl apply -f <(sed "s/pytorch-dist-git2/pytorch-dist-git2-$(uuidgen | head -c 6)/g" pytorchjob-qwen2.5VL-3B-lora.yaml)
父主题: 日志方案