在Lite Cluster资源池上使用Snt9B完成分布式训练任务

场景描述

本案例介绍如何在Snt9B上进行分布式训练任务，其中Cluster资源池已经默认安装volcano调度器，训练任务默认使用volcano job形式下发lite池集群。训练测试用例使用NLP的bert模型。

图1 任务示意图

操作步骤

拉取镜像。本测试镜像为bert_pretrain_mindspore:v1，已经把测试数据和代码打进镜像中。

docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1

在主机上新建config.yaml文件。

config.yaml文件用于配置pod，本示例中使用sleep命令启动pod，便于进入pod调试。您也可以修改command为对应的任务启动命令（如“python train.py”），任务会在启动容器后执行。

config.yaml内容如下：

apiVersion: v1
kind: ConfigMap
metadata:
  name: configmap1980-yourvcjobname     # 前缀使用“configmap1980-”不变，后接vcjob的名字
  namespace: default                      # 命名空间自选，需要和下边的vcjob处在同一命名空间
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
data:                    #data内容保持不动，初始化完成，会被volcano插件自动修改
  jobstart_hccl.json: |
    {
        "status":"initializing"
    }
---
apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The volcano API must be used.
kind: Job                               # Only the job type is supported at present.
metadata:
  name: yourvcjobname                  # job名字，需要和configmap中名字保持一致
  namespace: default                      # 和configmap保持一致
  labels:
    ring-controller.cce: ascend-1980   # 保持不动
    fault-scheduling: "force"
spec:
  minAvailable: 1                       # The value of minAvailable is 1 in a single-node scenario and N in an N-node distributed scenario.
  schedulerName: volcano                # 保持不动，Use the Volcano scheduler to schedule jobs.
  policies:
    - event: PodEvicted
      action: RestartJob
  plugins:
    configmap1980:
    - --rank-table-version=v2  # 保持不动，生成v2版本ranktablefile
    env: []
    svc:
    - --publish-not-ready-addresses=true
  maxRetry: 3
  queue: default
  tasks:
  - name: "yourvcjobname-1"
    replicas: 1                              # The value of replicas is 1 in a single-node scenario and N in an N-node scenario. The number of NPUs in the requests field is 8 in an N-node scenario.
    template:
      metadata:
        labels:
          app: mindspore
          ring-controller.cce: ascend-1980  # 保持不动，The value must be the same as the label in ConfigMap and cannot be changed.
      spec:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: volcano.sh/job-name
                      operator: In
                      values:
                        - yourvcjobname
                topologyKey: kubernetes.io/hostname
        containers:
        - image: bert_pretrain_mindspore:v1               # 镜像地址，Training framework image, which can be modified.
          imagePullPolicy: IfNotPresent
          name: mindspore
          env:
          - name: name                               # The value must be the same as that of Jobname.
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: ip                                       # IP address of the physical node, which is used to identify the node where the pod is running
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          - name: framework
            value: "MindSpore"
          command:
          - "sleep"
          - "1000000000000000000"
          resources:
            requests:
             huawei.com/ascend-1980: "1"                 # 需求卡数，key保持不变。Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU.
            limits:
              huawei.com/ascend-1980: "1"                 # 限制卡数，key保持不变。The value must be consistent with that in requests.
          volumeMounts:
          - name: ascend-driver               #驱动挂载，保持不动
            mountPath: /usr/local/Ascend/driver
          - name: ascend-add-ons           #驱动挂载，保持不动
            mountPath: /usr/local/Ascend/add-ons
          - name: localtime
            mountPath: /etc/localtime
          - name: hccn                             #驱动hccn配置，保持不动
            mountPath: /etc/hccn.conf
          - name: npu-smi                             #npu-smi
            mountPath: /usr/local/sbin/npu-smi
        nodeSelector:
          accelerator/huawei-npu: ascend-1980
        volumes:
        - name: ascend-driver
          hostPath:
            path: /usr/local/Ascend/driver
        - name: ascend-add-ons
          hostPath:
            path: /usr/local/Ascend/add-ons
        - name: localtime
          hostPath:
            path: /etc/localtime                      # Configure the Docker time.
        - name: hccn
          hostPath:
            path: /etc/hccn.conf
        - name: npu-smi
          hostPath:
            path: /usr/local/sbin/npu-smi
        restartPolicy: OnFailure

根据config.yaml创建pod。
```
kubectl apply -f config.yaml
```
检查pod启动情况，执行下述命令。如果显示“1/1 running”状态代表启动成功。
```
kubectl get pod -A
```
进入容器，{pod_name}替换为您的pod名字（get pod中显示的名字），{namespace}替换为您的命名空间（默认为default）。
```
kubectl exec -it {pod_name} bash -n {namespace}
```
查看卡信息，执行以下命令。
```
npu-smi info
```
kubernetes会根据config.yaml文件中配置的卡数分配资源给pod，如下图所示由于配置了1卡因此在容器中只会显示1卡，说明配置生效。

图2 查看卡信息
修改pod的卡数。由于本案例中为分布式训练，因此所需卡数修改为8卡。
删除已创建的pod。
```
kubectl delete -f config.yaml
```
将config.yaml文件中“limit”和“request”改为8。
```
vi config.yaml
```
图3 修改卡数
重新创建pod。
```
kubectl apply -f config.yaml
```
进入容器并查看卡信息，{pod_name}替换为您的pod名字，{namespace}替换为您的命名空间（默认为default）。
```
kubectl exec -it {pod_name} bash -n {namespace}
npu-smi info
```
如图所示为8卡，pod配置成功。

图4 查看卡信息
查看卡间通信配置文件，执行以下命令。
```
cat /user/config/jobstart_hccl.json
```
多卡训练时，需要依赖“rank_table_file”做卡间通信的配置文件，该文件自动生成，pod启动之后文件地址。为“/user/config/jobstart_hccl.json”，“/user/config/jobstart_hccl.json”配置文件生成需要一段时间，业务进程需要等待“/user/config/jobstart_hccl.json”中“status”字段为“completed”状态，才能生成卡间通信信息。如下图所示。

图5 卡间通信配置文件

启动训练任务。

cd /home/ma-user/modelarts/user-job-dir/code/bert/
export MS_ENABLE_GE=1
export MS_GE_TRAIN=1
python scripts/ascend_distributed_launcher/get_distribute_pretrain_cmd.py --run_script_dir ./scripts/run_distributed_pretrain_ascend.sh --hyper_parameter_config_dir ./scripts/ascend_distributed_launcher/hyper_parameter_config.ini --data_dir /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/ --hccl_config /user/config/jobstart_hccl.json --cmd_file ./distributed_cmd.sh
bash scripts/run_distributed_pretrain_ascend.sh /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/ /user/config/jobstart_hccl.json

图6 启动训练任务