Using Snt9B for Inference in a Lite Cluster Resource Pool

Description

This case outlines the process of using the Deployment mechanism to deploy a real-time inference service in the Snt9B environment. Create a pod to host the service, log in to the pod container to deploy the real-time service, and create a terminal as the client to access the service to test its functions.

Figure 1 Task diagram
Click to enlarge

Procedure

Pulls the image. The test image is bert_pretrain_mindspore:v1, which contains the test data and code.

docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1

Create the config.yaml file on the host.

Configure pods using this file. For debugging, start a pod using the sleep command. Alternatively, replace the command with the boot command for your job (for example, python inference.py). The job will run once the container starts.

The file content is as follows:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: yourapp
  labels:
      app: infers
spec:
  replicas: 1
  selector:
    matchLabels:
      app: infers
  template:
    metadata: 
      labels:
         app: infers
    spec:
      schedulerName: volcano
      nodeSelector:
        accelerator/huawei-npu: ascend-1980
      containers:
      - image: bert_pretrain_mindspore:v1                  # Inference image name
        imagePullPolicy: IfNotPresent
        name: mindspore
        command:
        - "sleep"
        - "1000000000000000000"
        resources:
          requests:
            huawei.com/ascend-1980: "1"             # Number of required PUs. The keys remain the same. Number of required NPUs. The maximum value is 16. You can add lines below to configure resources such as memory and CPU.
          limits:
            huawei.com/ascend-1980: "1"             # Limit the number of PUs. The keys remain the same. The value must be consistent with that in requests.
        volumeMounts:
          - name: ascend-driver               # Mount driver. Retain the settings.
          mountPath: /usr/local/Ascend/driver
          - name: ascend-add-ons           # Mount driver. Retain the settings.
          mountPath: /usr/local/Ascend/add-ons
          - name: hccn                             # HCCN configuration of the driver. Retain the settings.
          mountPath: /etc/hccn.conf
        - name: npu-smi                             #npu-smi
          mountPath: /usr/local/sbin/npu-smi
        - name: localtime                       #The container time must be the same as the host time.
          mountPath: /etc/localtime
      volumes:
      - name: ascend-driver
        hostPath:
          path: /usr/local/Ascend/driver
      - name: ascend-add-ons
        hostPath:
          path: /usr/local/Ascend/add-ons
      - name: hccn
        hostPath:
          path: /etc/hccn.conf
      - name: npu-smi
        hostPath:
          path: /usr/local/sbin/npu-smi
      - name: localtime
        hostPath:
          path: /etc/localtime

Create a pod based on the config.yaml file.
```
kubectl apply -f config.yaml
```
Run the following command to check the pod startup status. If 1/1 running is displayed, the startup is successful.
```
kubectl get pod -A
```
Go to the container, replace {pod_name} with your pod name (displayed by the get pod command), and replace {namespace} with your namespace (default).
```
kubectl exec -it {pod_name} bash -n {namespace}
```

Activate the conda mode.

su - ma-user //Switch the user identity.
conda activate MindSpore //Activate the MindSpore environment.

Create test code test.py.

from flask import Flask, request
import json 
app = Flask(__name__)

@app.route('/greet', methods=['POST'])
def say_hello_func():
    print("----------- in hello func ----------")
    data = json.loads(request.get_data(as_text=True))
    print(data)
    username = data['name']
    rsp_msg = 'Hello, {}!'.format(username)
    return json.dumps({"response":rsp_msg}, indent=4)

@app.route('/goodbye', methods=['GET'])
def say_goodbye_func():
    print("----------- in goodbye func ----------")
    return '\nGoodbye!\n'


@app.route('/', methods=['POST'])
def default_func():
    print("----------- in default func ----------")
    data = json.loads(request.get_data(as_text=True))
    return '\n called default func !\n {} \n'.format(str(data))

# host must be "0.0.0.0", port must be 8080
if __name__ == '__main__':
    app.run(host="0.0.0.0", port=8080)

Execute the code. After the code is executed, a real-time service is deployed. The container is the server.

python test.py

Figure 2 Deploying a real-time service
Click to enlarge

Open a terminal in XShell and access the container (client) by referring to steps 5 to 7. Run the following commands to test the functions of the three APIs of the custom image. If the following information is displayed, the service is successfully invoked.
```
curl -X POST -H "Content-Type: application/json" --data '{"name":"Tom"}'  127.0.0.1:8080/
curl -X POST -H "Content-Type: application/json" --data '{"name":"Tom"}' 127.0.0.1:8080/greet
curl -X GET 127.0.0.1:8080/goodbye
```
Figure 3 Accessing a real-time service

Set limit and request to proper values to restrict the number of CPUs and memory size. A single Snt9B node is equipped with eight Snt9B PUs and 192 vCPUs 1,536 GB. Properly plan the CPU and memory allocations to avoid task failures due to insufficient CPU and memory limits.