Help Center/ ModelArts/ Service Overview/ Functions/ Lite Cluster & Server Introduction

Updated on 2025-02-07 GMT+08:00

View PDF

Lite Cluster & Server Introduction

ModelArts Lite is a cloud-native AI computing power cluster that combines hardware and software optimization. It provides an open, compatible, cost-effective, stable, and scalable platform for AI high-performance computing and other scenarios. It has been widely used in areas such as large-scale model training and inference, autonomous driving, AIGC, and content moderation.

ModelArts Lite has two forms:

ModelArts Lite Server offers different models of xPU bare metal servers. You can access them through EIPs and install relevant drivers and software on the given OS image. You can use SFS or OBS for data storage and retrieval operations, meeting the needs of algorithm engineers for daily training. Refer to Elastic BMS Lite Server.
ModelArts Lite Cluster is tailored for users focused on Kubernetes resources. It offers a managed Kubernetes cluster with mainstream AI development plug-ins and proprietary acceleration plug-ins. This setup provides AI-native resources and task capabilities in a cloud-native manner, allowing you to directly manage nodes and Kubernetes clusters within the resource pool. See Elastic Kubernetes Cluster.

ModelArts Lite Cluster supports the following features:

Support for servers with different subscription periods in the same Ascend computing resource pool
In the same Ascend computing resource pool, you can subscribe to different types/billing cycles of resources. This solves the following scenarios:
- Users cannot scale short-term nodes in a long-term resource pool.
- Users cannot add pay-per-use nodes (including AutoScaler scenarios) in a yearly/monthly resource pool.
Support for SFS product permission partitioning
Enabling SFS permission partitioning provides fine-grained access control over mounted SFS folders during training, preventing unauthorized users from accidentally deleting all data.
Support for selecting driver versions in the resource pool
By selecting the driver version in the resource pool, the issue of all nodes in the resource pool having the same driver version and new nodes not automatically upgrading to that version is resolved. This optimizes the current manual handling process and reduces O&M costs.
Support for enabling admission control by default for newly added nodes in the cluster to launch real GPU/NPU detection tasks
When the cluster is scaled out, the newly added nodes are set to enable admission control by default. This admission control can also be disabled to improve the success rate of launching real GPU/NPU detection tasks.