文档首页/ AI开发平台ModelArts/ ModelArts用户指南(Lite Cluster)/ Lite Cluster资源使用/ 在Lite Cluster资源池上使用Snt9B完成分布式训练任务
更新时间:2024-12-09 GMT+08:00

在Lite Cluster资源池上使用Snt9B完成分布式训练任务

场景描述

本案例介绍如何在Snt9B上进行分布式训练任务,其中Cluster资源池已经默认安装volcano调度器,训练任务默认使用volcano job形式下发lite池集群。训练测试用例使用NLP的bert模型。

图1 任务示意图

操作步骤

  1. 拉取镜像。本测试镜像为bert_pretrain_mindspore:v1,已经把测试数据和代码打进镜像中。

    docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
    docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1

  2. 在主机上新建config.yaml文件。

    config.yaml文件用于配置pod,本示例中使用sleep命令启动pod,便于进入pod调试。您也可以修改command为对应的任务启动命令(如“python train.py”),任务会在启动容器后执行。

    config.yaml内容如下:
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: configmap1980-yourvcjobname     # 前缀使用“configmap1980-”不变,后接vcjob的名字
      namespace: default                      # 命名空间自选,需要和下边的vcjob处在同一命名空间
      labels:
        ring-controller.cce: ascend-1980   # 保持不动
    data:                    #data内容保持不动,初始化完成,会被volcano插件自动修改
      jobstart_hccl.json: |
        {
            "status":"initializing"
        }
    ---
    apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The volcano API must be used.
    kind: Job                               # Only the job type is supported at present.
    metadata:
      name: yourvcjobname                  # job名字,需要和configmap中名字保持一致
      namespace: default                      # 和configmap保持一致
      labels:
        ring-controller.cce: ascend-1980   # 保持不动
        fault-scheduling: "force"
    spec:
      minAvailable: 1                       # The value of minAvailable is 1 in a single-node scenario and N in an N-node distributed scenario.
      schedulerName: volcano                # 保持不动,Use the Volcano scheduler to schedule jobs.
      policies:
        - event: PodEvicted
          action: RestartJob
      plugins:
        configmap1980:
        - --rank-table-version=v2  # 保持不动,生成v2版本ranktablefile
        env: []
        svc:
        - --publish-not-ready-addresses=true
      maxRetry: 3
      queue: default
      tasks:
      - name: "yourvcjobname-1"
        replicas: 1                              # The value of replicas is 1 in a single-node scenario and N in an N-node scenario. The number of NPUs in the requests field is 8 in an N-node scenario.
        template:
          metadata:
            labels:
              app: mindspore
              ring-controller.cce: ascend-1980  # 保持不动,The value must be the same as the label in ConfigMap and cannot be changed.
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                        - key: volcano.sh/job-name
                          operator: In
                          values:
                            - yourvcjobname
                    topologyKey: kubernetes.io/hostname
            containers:
            - image: bert_pretrain_mindspore:v1               # 镜像地址,Training framework image, which can be modified.
              imagePullPolicy: IfNotPresent
              name: mindspore
              env:
              - name: name                               # The value must be the same as that of Jobname.
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.name
              - name: ip                                       # IP address of the physical node, which is used to identify the node where the pod is running
                valueFrom:
                  fieldRef:
                    fieldPath: status.hostIP
              - name: framework
                value: "MindSpore"
              command:
              - "sleep"
              - "1000000000000000000"
              resources:
                requests:
                  huawei.com/ascend-1980: "1"                 # 需求卡数,key保持不变。Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU.
                limits:
                  huawei.com/ascend-1980: "1"                 # 限制卡数,key保持不变。The value must be consistent with that in requests.
              volumeMounts:
              - name: ascend-driver               #驱动挂载,保持不动
                mountPath: /usr/local/Ascend/driver
              - name: ascend-add-ons           #驱动挂载,保持不动
                mountPath: /usr/local/Ascend/add-ons
              - name: localtime
                mountPath: /etc/localtime
              - name: hccn                             #驱动hccn配置,保持不动
                mountPath: /etc/hccn.conf
              - name: npu-smi                             #npu-smi
                mountPath: /usr/local/sbin/npu-smi
            nodeSelector:
              accelerator/huawei-npu: ascend-1980
            volumes:
            - name: ascend-driver
              hostPath:
                path: /usr/local/Ascend/driver
            - name: ascend-add-ons
              hostPath:
                path: /usr/local/Ascend/add-ons
            - name: localtime
              hostPath:
                path: /etc/localtime                      # Configure the Docker time.
            - name: hccn
              hostPath:
                path: /etc/hccn.conf
            - name: npu-smi
              hostPath:
                path: /usr/local/sbin/npu-smi
            restartPolicy: OnFailure

  3. 根据config.yaml创建pod。

    kubectl apply -f config.yaml

  4. 检查pod启动情况,执行下述命令。如果显示“1/1 running”状态代表启动成功。

    kubectl get pod -A

  5. 进入容器,{pod_name}替换为您的pod名字(get pod中显示的名字),{namespace}替换为您的命名空间(默认为default)。

    kubectl exec -it {pod_name} bash -n {namespace}

  6. 查看卡信息,执行以下命令。

    npu-smi info

    kubernetes会根据config.yaml文件中配置的卡数分配资源给pod,如下图所示由于配置了1卡因此在容器中只会显示1卡,说明配置生效。

    图2 查看卡信息

  7. 修改pod的卡数。由于本案例中为分布式训练,因此所需卡数修改为8卡。

    删除已创建的pod。
    kubectl delete -f config.yaml
    将config.yaml文件中“limit”“request”改为8。
    vi config.yaml
    图3 修改卡数

    重新创建pod。

    kubectl apply -f config.yaml
    进入容器并查看卡信息,{pod_name}替换为您的pod名字,{namespace}替换为您的命名空间(默认为default)。
    kubectl exec -it {pod_name} bash -n {namespace}
    npu-smi info

    如图所示为8卡,pod配置成功。

    图4 查看卡信息

  8. 查看卡间通信配置文件,执行以下命令。

    cat /user/config/jobstart_hccl.json

    多卡训练时,需要依赖“rank_table_file”做卡间通信的配置文件,该文件自动生成,pod启动之后文件地址。为“/user/config/jobstart_hccl.json”“/user/config/jobstart_hccl.json”配置文件生成需要一段时间,业务进程需要等待“/user/config/jobstart_hccl.json”中“status”字段为“completed”状态,才能生成卡间通信信息。如下图所示。

    图5 卡间通信配置文件

  9. 启动训练任务。

    cd /home/ma-user/modelarts/user-job-dir/code/bert/
    export MS_ENABLE_GE=1
    export MS_GE_TRAIN=1
    python scripts/ascend_distributed_launcher/get_distribute_pretrain_cmd.py --run_script_dir ./scripts/run_distributed_pretrain_ascend.sh --hyper_parameter_config_dir ./scripts/ascend_distributed_launcher/hyper_parameter_config.ini --data_dir /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/ --hccl_config /user/config/jobstart_hccl.json --cmd_file ./distributed_cmd.sh
    bash scripts/run_distributed_pretrain_ascend.sh /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/ /user/config/jobstart_hccl.json
    图6 启动训练任务

    训练任务加载需要一定时间,在等待若干分钟后,可以执行下述命令查看卡信息。如下图可见,8张卡均被占用,说明训练任务在进行中

    npu-smi info
    图7 查看卡信息

    若想停止训练任务,可执行下述命令关闭进程,查询进程后显示已无运行中python进程。

    pkill -9 python
    ps -ef
    图8 关闭训练进程

    limit/request配置cpu和内存大小,已知单节点Snt9B机器为:8张Snt9B卡+192u1536g,请合理规划,避免cpu和内存限制过小引起任务无法正常运行。