更新时间:2025-10-22 GMT+08:00
分享

CANN应用类日志存储至OBS配置方法

创建OBS的PVC

  1. 创建OBS桶。
  2. 在CCE控制台创建关联OBS桶的PV及PVC。
    1. 进入要挂载OBS的CCE集群控制台。

    2. 创建存储卷PV。

    3. 创建存储卷声明pvc

  3. 创建完成后,在CCE中可以看到对应的PVC,在容器中直接引用即可。
    kubectl get pvc pvc-obs-glj -o yaml

训练作业关联PVC

若需要将日志写入对应的桶中,通常需要对训练的config.yaml做如下改造以引用上一步创建的PVC。

  • 获取jobid和nodename以用于将每个任务产生的日志写入特定的桶中

  • 定义一个重定向的路径,引用上一步取到的jobid 和nodename

  • 定义转储规则

  • 一个完整的yaml示例(供参考)
    apiVersion: batch.volcano.sh/v1alpha1
    kind: HyperJob 
    metadata: 
      name: hyperjob-test
    spec:
      replicatedJobs:
       - replicas: 2 # vcjob的数量,和后面启动命令的WORD_SIZE及RANK有联动
         name: vcjob-test # vcjob的名称
         template:WORD_SIZE
           minAvailable: 2 # vcjob内最小可用数量,AI任务场景等于每个vcjob内副本数
           tasks:
            - replicas: 2 # Pod的副本数,和后面启动命令的WORD_SIZE及RANK有联动
              name: worker  # Pod的名称
              policies:
              - event: PodEvicted
                action: RestartJob
              template:
                spec:
                  volumes:
                    - name: ascend-driver
                      hostPath: {path: /usr/local/Ascend/driver}
                    - name: ascend-add-ons
                      hostPath: {path: /usr/local/Ascend/add-ons}
                    - name: npu-smi
                      hostPath: {path: /usr/local/sbin/npu-smi}
                    - name: localtime
                      hostPath: {path: /etc/localtime}
                    - name: sfs
                      hostPath: {path: /mnt/sfs_turbo/}
                    - name: ascend-install
                      hostPath: {path: /etc/ascend_install.inf}
                    - name: dcmi
                      hostPath: {path: /usr/local/dcmi}
                    - name: pvc-obs-glj  # pvc挂载信息
                      persistentVolumeClaim:
                        claimName: pvc-obs-glj
                  containers:
                  - name: ${container_name} # 容器名称
                    image: ${image_name} # 镜像地址
                    command: ["/bin/bash", "-c"]
                    args:
                      - ${command}
                    resources:
                      requests: 
                        huawei.com/ascend-1980: "16"  # 申请的NPU数量,需与limits值保持一致,系统将根据申请NPU的数量决定Pod是否独占节点
                      limits: 
                        huawei.com/ascend-1980: "16" # 修改项 每个节点的需求卡数,key保持不变。
                    ports:
                    - containerPort: 29500
                      name: trainport
                      protocol: TCP
                    volumeMounts:
                    - name: ascend-driver # 驱动挂载,保持不动
                      mountPath: /usr/local/Ascend/driver
                    - name: ascend-add-ons # 驱动挂载,保持不动
                      mountPath: /usr/local/Ascend/add-ons
                    - name: npu-smi
                      mountPath: /usr/local/sbin/npu-smi
                    - name: localtime
                      mountPath: /etc/localtime
                    - name: sfs
                      mountPath: /mnt/sfs_turbo/
                    - name: ascend-install
                      mountPath: /etc/ascend_install.inf
                    - name: dcmi
                      mountPath: /usr/local/dcmi
    		- name: pvc-obs-glj
                      mountPath: /data/logs
                    env:
    		- name: NODE_NAME
                      valueFrom:
                        fieldRef:
                          apiVersion: v1
                          fieldPath: spec.nodeName
                    - name: JOB_NAME
                      valueFrom:
                        fieldRef:
                          apiVersion: v1
                          fieldPath: metadata.labels['app']
                    - name: ASCEND_PROCESS_LOG_PATH
                      value: /data/logs/$(JOB_NAME)/$(NODE_NAME)/ascend_plog/
                    - name: ASCEND_WORK_PATH
                      value: /data/logs/$(JOB_NAME)/$(NODE_NAME)/ascend_work_path/log/
                    - name: HYPERJOB_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.annotations['volcano.sh/hyperjob-name']
                    - name: HYPERJOB_REPLICATEDJOB_NAME
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-name']
                    - name: TASK_SPEC
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.annotations['volcano.sh/task-spec']
                    - name: HYPERJOB_REPLICATEDJOB_INDEX
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.annotations['volcano.sh/hyperjob-replicatedjob-index']
                    - name: TASK_INDEX
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.annotations['volcano.sh/task-index']
                  restartPolicy: OnFailure

如何启动一个带独立uuid的作业:

kubectl apply -f <(sed "s/pytorch-dist-git2/pytorch-dist-git2-$(uuidgen | head -c 6)/g" pytorchjob-qwen2.5VL-3B-lora.yaml)

相关文档