Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Configuring the Lite Cluster Environment

Updated on 2024-12-31 GMT+08:00

Configure the Lite Cluster environment by following this section, which applies to the accelerator card environment setup.

Prerequisites

  • You have purchased and enabled cluster resources. For details, see Enabling Lite Cluster Resources.
  • To configure and use a cluster, you need to have a solid understanding of Kubernetes Basics, as well as basic knowledge of networks, storage, and images.

Configuration Process

Figure 1 Flowchart
Table 1 Configuration process

Step

Task

Description

1

Configuring the Lite Cluster Network

After purchasing a resource pool, create an elastic IP (EIP) and configure the network. Once the network is set up, you can access cluster resources through the EIP.

2

Configuring kubectl

With kubectl configured, you can use the command line tool to manage your Kubernetes clusters by running kubectl commands.

3

Configuring Lite Cluster Storage

The available storage space is determined by dockerBaseSize when no external storage is mounted. However, the accessible storage space is limited. It is recommended that you mount external storage to overcome this limitation. You can mount storage to a container in various methods. The recommended method depends on the scenario, and you can choose one that meets your service needs.

4

(Optional) Configuring the Driver

Configure the corresponding driver to ensure proper use of GPU/Ascend resources in nodes within a dedicated resource pool. If no custom driver is configured and the default driver does not meet service requirements, upgrade the default driver to the required version.

5

(Optional) Configuring Image Pre-provisioning

Lite Cluster resource pools enable image pre-provisioning, which pulls images from nodes in the pools beforehand, accelerating image pulling during inference and large-scale distributed training.

Quick Configuration of Lite Cluster Resources

This section shows how to configure Lite Cluster resources quickly to log in to nodes and view accelerator cards, then complete a training job. Before you start, you need to purchase resources. For details, see Enabling Lite Cluster Resources.

  1. Log in to a node.

    (Recommended) Method 1: Binding an EIP

    Bind an EIP to the node and use Bash tools such as Xshell and MobaXterm to log in to the node.

    1. Log in to the CCE console.
    2. On the CCE cluster details page, click Nodes. In the Nodes tab, click the name of the target node to go to the ECS page.
      Figure 2 Node management

    3. Bind an EIP.
      Choose or create one.
      Figure 3 EIP

      Click Buy EIP.
      Figure 4 Binding an EIP
      Figure 5 Buying an EIP

      Refresh the list on the ECS page after completing the purchase.

      Select the new EIP and click OK.
      Figure 6 Binding an EIP

    4. Log in to the node using MobaXterm or Xshell. To log in using MobaXterm, enter the EIP.
      Figure 7 Logging in to a node

    Method 2: Using Huawei Cloud Remote Login

    1. Log in to the CCE console.
    2. On the CCE cluster details page, click Nodes. In the Nodes tab, click the name of the target node to go to the ECS page.
      Figure 8 Node management

    3. Click Remote Login. In the displayed dialog box, click Log In.
      Figure 9 Remote login

    4. After setting parameters such as the password in CloudShell, click Connect to log in to the node. For details about CloudShell, see Logging In to a Linux ECS Using CloudShell.

  2. Configure the kubectl tool.

    Log in to the ModelArts console. From the navigation pane, choose AI Dedicated Resource Pools > Elastic Clusters.

    Click the new dedicated resource pool to access its details page. Click the CCE cluster to access its details page.

    On the CCE cluster details page, locate Connection Information in the cluster information.
    Figure 10 Connection Information

    Use kubectl.
    • To use kubectl through the intranet, install it on a node within the same VPC as the cluster. Click Configure next to kubectl to use the kubectl tool.
      Figure 11 Using kubectl through the intranet
    • To use kubectl through an EIP, install it on any node that associated with the EIP.
      To bind an EIP, click Bind next to EIP.
      Figure 12 Binding an EIP

      Select an EIP and click OK. If no EIP is available, click Create EIP to create one.

      After the binding is complete, click Configure next to kubectl and use kubectl as prompted.

  3. Start a task using docker run.

    Snt9B clusters managed in CCE automatically install container runtime. The following uses Docker as an example and is only for testing and verification. You can start the container for testing without creating a deployment or volcano job. The BERT NLP model is used in the training test cases.

    1. Pull the image. The test image is bert_pretrain_mindspore:v1, which contains the test data and code.
      docker pull swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1
      docker tag swr.cn-southwest-2.myhuaweicloud.com/os-public-repo/bert_pretrain_mindspore:v1 bert_pretrain_mindspore:v1
    2. Start the container.
      docker run -tid --privileged=true \
      -u 0 \
      -v /dev/shm:/dev/shm \
      --device=/dev/davinci0 \
      --device=/dev/davinci1 \
      --device=/dev/davinci2 \
      --device=/dev/davinci3 \
      --device=/dev/davinci4 \
      --device=/dev/davinci5 \
      --device=/dev/davinci6 \
      --device=/dev/davinci7 \
      --device=/dev/davinci_manager \
      --device=/dev/devmm_svm \
      --device=/dev/hisi_hdc \
      -v /usr/local/Ascend/driver:/usr/local/Ascend/driver  \
      -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
      -v /etc/hccn.conf:/etc/hccn.conf \
      bert_pretrain_mindspore:v1 \
      bash

      Parameters:

      • --privileged=true //Privileged container, which can access all devices connected to the host.
      • -u 0 //root user
      • -v /dev/shm:/dev/shm //Prevents the training task from failing due to insufficient shared memory.
      • --device=/dev/davinci0 //NPU card device
      • --device=/dev/davinci1 //NPU card device
      • --device=/dev/davinci2 //NPU card device
      • --device=/dev/davinci3 //NPU card device
      • --device=/dev/davinci4 //NPU card device
      • --device=/dev/davinci5 //NPU card device
      • --device=/dev/davinci6 //NPU card device
      • --device=/dev/davinci7 //NPU card device
      • --device=/dev/davinci_manager //Da Vinci-related management device
      • --device=/dev/devmm_svm //Management device
      • --device=/dev/hisi_hdc //Management device
      • -v /usr/local/Ascend/driver:/usr/local/Ascend/driver //NPU card driver mounting
      • -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi //npu-smi tool mounting
      • -v /etc/hccn.conf:/etc/hccn.conf //hccn.conf configuration mounting
    3. Access the container and view the card information.
      docker exec -it xxxxxxx bash    //Access the container. Replace xxxxxxx with the container ID.
      npu-smi info    //View card information.
      Figure 13 Viewing NPU information
    4. Start the training task:
      cd /home/ma-user/modelarts/user-job-dir/code/bert/
      export MS_ENABLE_GE=1
      export MS_GE_TRAIN=1
      bash scripts/run_standalone_pretrain_ascend.sh 0 1 /home/ma-user/modelarts/user-job-dir/data/cn-news-128-1f-mind/
      Figure 14 Training process

      Check the card usage. The card 0 is in use, as expected.

      npu-smi info    //View card information.
      Figure 15 Viewing NPU information

      The training task takes about two hours to complete and then automatically stops. To stop a training task, run the commands below:

      pkill -9 python
      ps -ef
      Figure 16 Stopping the training process

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback