Updated on 2024-11-11 GMT+08:00

Configuring the Lite Cluster Environment

Configure the Lite Cluster environment by following this section, which applies to the accelerator card environment setup.

Prerequisites

  • You have purchased and enabled cluster resources. For details, see Enabling Lite Cluster Resources.
  • To configure and use a cluster, you need to have a solid understanding of Kubernetes Basics, as well as basic knowledge of networks, storage, and images.

Configuration Process

Figure 1 Flowchart

Table 1 Configuration process

Step

Task

Description

1

Configuring the Lite Cluster Network

After purchasing a resource pool, create an elastic IP (EIP) and configure the network. Once the network is set up, you can access cluster resources through the EIP.

2

Configuring kubectl

With kubectl configured, you can use the command line tool to manage your Kubernetes clusters by running kubectl commands.

3

Configuring Lite Cluster Storage

The available storage space is determined by dockerBaseSize when no external storage is mounted. However, the accessible storage space is limited. It is recommended that you mount external storage to overcome this limitation. You can mount storage to a container in various methods. The recommended method depends on the scenario, and you can choose one that meets your service needs.

4

(Optional) Configuring the Driver

Configure the corresponding driver to ensure proper use of GPU/Ascend resources in nodes within a dedicated resource pool. If no custom driver is configured and the default driver does not meet service requirements, upgrade the default driver to the required version.

5

(Optional) Configuring Image Pre-provisioning

Lite Cluster resource pools enable image pre-provisioning, which pulls images from nodes in the pools beforehand, accelerating image pulling during inference and large-scale distributed training.

Quick Configuration of Lite Cluster Resources

This section shows how to configure Lite Cluster resources quickly to log in to nodes and view accelerator cards, then complete a training job. Before you start, you need to purchase resources. For details, see Enabling Lite Cluster Resources.

  1. Log in to a node.

    (Recommended) Method 1: Binding an EIP

    Bind an EIP to the node and use Bash tools such as Xshell and MobaXterm to log in to the node.

    1. Log in to the CCE console.
    2. On the CCE cluster details page, click Nodes. In the Nodes tab, click the name of the target node to go to the ECS page.
      Figure 2 Node management

    3. Bind an EIP.
      Choose or create one.
      Figure 3 EIP

      Click Buy EIP.
      Figure 4 Binding an EIP
      Figure 5 Buying an EIP

      Refresh the list on the ECS page after completing the purchase.

      Select the new EIP and click OK.
      Figure 6 Binding an EIP

    4. Log in to the node using MobaXterm or Xshell. To log in using MobaXterm, enter the EIP.
      Figure 7 Logging in to a node

    Method 2: Using Huawei Cloud Remote Login

    1. Log in to the CCE console.
    2. On the CCE cluster details page, click Nodes. In the Nodes tab, click the name of the target node to go to the ECS page.
      Figure 8 Node management

    3. Click Remote Login. In the displayed dialog box, click Log In.
      Figure 9 Remote login

    4. After setting parameters such as the password in CloudShell, click Connect to log in to the node. For details about CloudShell, see Logging In to a Linux ECS Using CloudShell.

  2. Configure the kubectl tool.

    Log in to the ModelArts console. From the navigation pane, choose AI Dedicated Resource Pools > Elastic Clusters.

    Click the new dedicated resource pool to access its details page. Click the CCE cluster to access its details page.

    On the CCE cluster details page, locate Connection Information in the cluster information.
    Figure 10 Connection Information

    Use kubectl.
    • To use kubectl through the intranet, install it on a node within the same VPC as the cluster. Click Configure next to kubectl to use the kubectl tool.
      Figure 11 Using kubectl through the intranet
    • To use kubectl through an EIP, install it on any node that associated with the EIP.
      To bind an EIP, click Bind next to EIP.
      Figure 12 Binding an EIP

      Select an EIP and click OK. If no EIP is available, click Create EIP to create one.

      After the binding is complete, click Configure next to kubectl and use kubectl as prompted.

  3. Start a task using docker run.

    Snt9B clusters managed in CCE automatically install Docker. The following is only for testing and verification. You can start the container for testing without creating a deployment or volcano job. The BERT NLP model is used in the training test cases.

    1. Pull the image. The test image is bert_pretrain_mindspore:v1, which contains the test data and code.
      docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
      docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1
    2. Start the container.
      docker run -tid --privileged=true \
      -u 0 \
      -v /dev/shm:/dev/shm \
      --device=/dev/davinci0 \
      --device=/dev/davinci1 \
      --device=/dev/davinci2 \
      --device=/dev/davinci3 \
      --device=/dev/davinci4 \
      --device=/dev/davinci5 \
      --device=/dev/davinci6 \
      --device=/dev/davinci7 \
      --device=/dev/davinci_manager \
      --device=/dev/devmm_svm \
      --device=/dev/hisi_hdc \
      -v /usr/local/Ascend/driver:/usr/local/Ascend/driver  \
      -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
      -v /etc/hccn.conf:/etc/hccn.conf \
      bert_pretrain_mindspore:v1 \
      bash

      Parameter descriptions:

      • --privileged=true //Privileged container, which can access all devices connected to the host.
      • -u 0 //root user
      • -v /dev/shm:/dev/shm //Prevents the training task from failing due to insufficient shared memory.
      • --device=/dev/davinci0 //NPU card device
      • --device=/dev/davinci1 //NPU card device
      • --device=/dev/davinci2 //NPU card device
      • --device=/dev/davinci3 //NPU card device
      • --device=/dev/davinci4 //NPU card device
      • --device=/dev/davinci5 //NPU card device
      • --device=/dev/davinci6 //NPU card device
      • --device=/dev/davinci7 //NPU card device
      • --device=/dev/davinci_manager //Da Vinci-related management device
      • --device=/dev/devmm_svm //Management device
      • --device=/dev/hisi_hdc //Management device
      • -v /usr/local/Ascend/driver:/usr/local/Ascend/driver //NPU card driver mounting
      • -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi //npu-smi tool mounting
      • -v /etc/hccn.conf:/etc/hccn.conf //hccn.conf configuration mounting
    3. Access the container and view the card information.
      docker exec -it xxxxxxx bash    //Access the container. Replace xxxxxxx with the container ID.
      npu-smi info    //View card information.
      Figure 13 Viewing card information
    4. Start the training task:
      cd /home/ma-user/modelarts/user-job-dir/code/bert/
      export MS_ENABLE_GE=1
      export MS_GE_TRAIN=1
      bash scripts/run_standalone_pretrain_ascend.sh 0 1 /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/
      Figure 14 Training process

      Check the card usage. The card 0 is in use, as expected.

      npu-smi info    //View card information.
      Figure 15 Viewing card information

      The training task takes about two hours to complete and then automatically stops. To stop a training task, run the commands below:

      pkill -9 python
      ps -ef
      Figure 16 Stopping the training process