Enabling Lite Cluster Resources
Process
The following figure shows the process of enabling cluster resources.
Step |
Description |
---|---|
Contact your account manager to request resource specifications in advance. They will enable the specifications within one to three working days. If there is no account manager, submit a service ticket. |
|
Assign the necessary permissions to the target IAM user to use resource pools. |
|
Create an agency in ModelArts to authorize access to other cloud services. If you already have an agency, update its permissions. |
|
To run clusters, you will need more resources than Huawei Cloud's default quotas provided. This includes more ECS instances, memory, CPU cores, and EVS disk space. You will need to request a higher quota to meet these needs. Contact your customer manager for information on the quota solution. Increase the quota before purchasing and provisioning the resource, ensuring it exceeds the resource's requirements. |
|
When buying a Lite Cluster resource pool, choose a CCE cluster. If none is available, create one on the CCE console beforehand. |
|
Purchase Lite Cluster resources on the ModelArts console. |
Step1 Enabling Resource Specifications
Contact your account manager to request restricted specifications (such as modelarts.bm.npu.arm.8snt9b3.d) in advance. They will enable the specifications within one to three working days. If there is no account manager, submit a service ticket.
Step 2: Enabling Basic Permissions
Log in to the administrator account and grant the target IAM account basic permissions to use resource pools.
- Log in to the IAM console.
- In the navigation pane, choose User Groups and click Create User Group in the upper right corner.
- Enter a group name and click OK.
- Click Manage User in the Operation column and add the users for which you want to assign permissions to the user group.
- Click the name of the user group to go to the group details page.
- In the Permissions tab, click Authorize.
Figure 2 Assigning permissions
- Search for ModelArts FullAccess in the search box and select it.
Figure 3 ModelArts FullAccess
Repeat this step to select the following permissions:
- ModelArts FullAccess
- CTS Administrator
- CCE Administrator
- BMS FullAccess
- IMS FullAccess
- DEW KeypairReadOnlyAccess
- VPC FullAccess
- ECS FullAccess
- SFS Turbo FullAccess
- OBS Administrator
- AOM FullAccess
- TMS FullAccess
- BSS Administrator
- Click Next and set Scope to All resources.
- Click OK.
Step 3 Creating an Agency in ModelArts
- Creating an agency
Create an agency in ModelArts to authorize access to other cloud services.
To do so, log in to the ModelArts console. In the navigation pane on the left, choose Permission Management. On the displayed page, click Add Authorization.
- Updating an agency
Update the permissions for your existing ModelArts agency.
- Log in to the ModelArts console. In the navigation pane on the left, choose Resource Management > AI Dedicated Resource Pools > Elastic Clusters. On the displayed page, check whether a message is reported, indicating that the authorization is insufficient.
Figure 4 Insufficient permission on elastic clusters
- Click Authorize access to update the agency if needed. Select Append to Existing Entitlement and click OK. The system shows the permission update is successful.
Figure 5 Adding authorization
- Log in to the ModelArts console. In the navigation pane on the left, choose Resource Management > AI Dedicated Resource Pools > Elastic Clusters. On the displayed page, check whether a message is reported, indicating that the authorization is insufficient.
Step 4 Applying for a Higher Resource Quota
To run AI workloads in resource pools, you will need more resources than Huawei Cloud's default quotas provided. This includes more ECS instances, memory, CPU cores, and EVS disk space. To access these extra resources, request a higher quota. Confirm the solution with the customer manager, then apply for a higher resource quota by following these steps.
- Log in to Huawei Cloud console.
- Hover over Resources from the top navigation bar and choose My Quotas.
Figure 6 My Quotas
- On the Quotas page, click Increase Quota in the upper right corner and submit a service ticket.
Request the required number of ECS instances, CPU cores, RAM capacity (memory size), and EVS disk capacity. Contact your customer manager for quota details.
Figure 7 ECS resource type
Figure 8 EVS resource type
Increase the quota before purchasing and provisioning the resource, ensuring it exceeds the resource's requirements.
Step 5 Buying a CCE Cluster
When buying a Lite Cluster resource pool, choose a CCE cluster. If none is available, follow the instructions in Buying a CCE Standard/Turbo Cluster to acquire one. For details about the required cluster version, see Software Versions Required by Different Models.
Create a Lite Cluster resource pool only when the CCE cluster is running.
- CCE clusters of versions 1.23, 1.25, and 1.28 are supported.
- If no CCE cluster is available, create one. Create CCE clusters of version 1.28 using either the console or APIs. Create CCE clusters of versions 1.23 and 1.25 using APIs only. For details about how to create CCE clusters of different versions, see Kubernetes Version Policy.
- Upgrade your CCE cluster to version 1.28 if it is running an earlier version, such as 1.23 or lower. For details, see Process and Method of Upgrading a Cluster.
Step 6 Buying Lite Cluster Resources
- Log in to the ModelArts console. From the navigation pane, choose AI Dedicated Resource Pools > Elastic Clusters.
- On the Elastic Clusters page, click Buy Dedicated AI Cluster.
Table 2 Parameters Parameter
Sub-Parameter
Description
Name
N/A
Enter a name.
Only lowercase letters, digits, and hyphens (-) are allowed. The value must start with a lowercase letter and cannot end with a hyphen (-).
Description
N/A
Enter a brief description of the dedicated resource pool.
Product Version
N/A
Select ModelArts Lite.
Billing Mode
N/A
Select Pay-per-use or Yearly/Monthly.
- Yearly/Monthly
Yearly/Monthly is a prepaid billing mode in which your subscription is billed based on the required duration. This mode is more cost-effective when the usage duration is predictable.
- Pay-per-use
Pay-per-use is a postpaid billing mode in which your resources are billed based on usage duration. You can create or delete your resources at any time.
CCE cluster
N/A
Choose an existing CCE cluster from the drop-down list. Click Create Cluster on the right to create a cluster if none is available. For details about the required cluster version, see Software Versions Required by Different Models.
Create a Lite Cluster resource pool only when the CCE cluster is running.
NOTE:- CCE clusters of versions 1.23, 1.25, and 1.28 are supported.
- If no CCE cluster is available, create one. Create CCE clusters of version 1.28 using either the console or APIs. Create CCE clusters of versions 1.23 and 1.25 using APIs only. For details about how to create CCE clusters of different versions, see Kubernetes Version Policy.
- Upgrade your CCE cluster to version 1.28 if it is running an earlier version, such as 1.23 or lower. For details, see Process and Method of Upgrading a Cluster.
User-defined node name
N/A
Choose whether to enable this function to add a node name prefix.
- After a prefix is added, a node name consists of a prefix and a random number.
- The value can contain 1 to 64 characters.
- The prefix starts with a lowercase letter and only contains lowercase letters and digits. It is separated from the node name by a hyphen (-), for example, node-com.
Specification Management
N/A
You can add multiple specifications. Restrictions:
- Selecting multiple same specifications allows you to specify a node pool name by clicking Advanced Configuration. Only one node pool name can be left unspecified.
- The CPU architectures of all specifications must be identical, being either x86 or Arm.
- When selecting multiple GPU or NPU specifications, distributed training speed is impacted because different specifications' parameter network planes are not connected. For distributed training, it is recommended that you choose only one GPU or NPU specification.
- You can add up to 10 specifications to a resource pool.
Specifications
Select required specifications. Due to system loss, the available resources are less than those specified in the specifications. After a dedicated resource pool is created, view the available resources in the Nodes tab on the details page.
AZ
Select Automatic or Manual. An AZ is a physical region where resources use independent power supplies and networks. AZs are physically isolated but interconnected over an intranet.
- Automatic: AZs are automatically allocated.
- Manual: Specify AZs for resource pool nodes. To ensure system disaster recovery, deploy all nodes in the same AZ. You can set the number of nodes in an AZ.
Nodes
Select the number of nodes in a dedicated resource pool. More nodes mean higher computing performance.
If AZ is set to Manual, you do not need to configure Nodes.
NOTE:It is good practice to create no more than 30 nodes at a time. Otherwise, the creation may fail due to traffic limiting.
You can purchase nodes by rack for certain specifications. The total number of nodes is the number of racks multiplied by the number of nodes per rack. Purchasing a full rack allows you to isolate tasks physically, preventing communication conflicts and maintaining linear computing performance as task scale increases. All nodes in a rack must be created or deleted together.
Figure 9 Purchasing a rack of instances
Advanced Configuration
Configure the following parameters if you enable advanced configuration:
- Container Engine Space Size: The default value is 50 GiB. The default and minimum values are 50 GiB. The maximum value depends on the specifications, and can be found in the console prompt.
- Container Engine: Container engines, one of the most important components of Kubernetes, manage the lifecycle of images and containers. kubelet interacts with a container engine through the Container Runtime Interface (CRI) to manage images and containers.
When creating a resource pool, you can choose a container engine. Alternatively, you can change the container engine on the scaling page after the resource pool is created. Containerd has a shorter call chain, fewer components, and lower resource requirements, making it more stable. For details about the differences between Containerd and Docker, see Container Engines.
The CCE cluster version determines the available container engines. If it is earlier than 1.23, only Docker is supported. If it is 1.27 or later, only containerd is supported. For all other versions, both containerd and Docker are options.
- Node Pool Name: You can customize the name of the new node pool. If you do not specify a name, the default name Specification-default is used. When selecting same specifications for multiple nodes, only one node pool name can be left unspecified.
- Virtual Private Cloud: Specifies the VPC network where the CCE cluster is located and cannot be changed.
- Node subnet: Choose a subnet within the same VPC. New nodes will be created using this subnet.
- Associated Security Group: Specifies the security group used by nodes created in the node pool. A maximum of four security groups can be selected. Traffic needs to pass through certain ports in the node security group to ensure node communications. If no security group is associated, the cluster's default rules are applied.
- Resource Tag: Add resource tags to classify resources.
- Kubernetes Label: Add key/value pairs that are attached to Kubernetes objects, such as Pods. A maximum of 20 labels can be added. Labels can be used to distinguish nodes. With workload affinity settings, container pods can be scheduled to a specified node.
- Taint: This parameter is left blank by default. Configure anti-affinity by adding taints to nodes, with a maximum of 20 taints per node.
- Post-installation Command: Enter the script command, which cannot include Chinese characters. The Base64-encoded script must be transferred. The encoded script should not exceed 2,048 characters. The script will be executed after Kubernetes software is installed, which does not affect the installation.
NOTE:- The name of an existing node pool in a resource pool cannot be changed.
- Do not run the reboot command in the post-installation script to restart the system immediately. To restart the system, run the shutdown -r 1 command to restart with a delay of one minute.
Custom Driver
N/A
This function is disabled by default. Some GPU and Ascend resource pools allow custom driver installation. The driver is automatically installed in the cluster by default. Enable this function only if you need to specify the driver version. Determine the required driver version and choose the matching driver when buying Lite Cluster resources.
GPU/Ascend Driver
N/A
This parameter is displayed if Custom Driver is enabled. You can select a GPU or Ascend driver. The value depends on the driver you choose.
For details about the required gpu-driver version, see Software Versions Required by Different Models.
Required Duration
N/A
Select the time length for which you want to use the resource pool. This parameter is mandatory only when the Yearly/Monthly billing mode is selected.
Login Mode
N/A
Choose a cluster login mode, Password or Key pair.
- Password: The default username is root, and you can set a password.
- Key pair: Select an existing key pair or click Create Key Pair to create one.
Advanced Configuration
N/A
You can select Configure Now to configure tag information.
ModelArts can work with Tag Management Service (TMS). When creating resource-consuming tasks in ModelArts, for example, training jobs, configure tags for these tasks so that ModelArts can use tags to manage resources by group.
For details about how to use tags, see Using TMS Tags to Manage Resources by Group.
Click Next and confirm the settings. Then, click Submit to create the dedicated resource pool.- After a resource pool is created, its status changes to Running. Only when the number of available nodes is greater than 0, tasks can be delivered to this resource pool.
Figure 10 Viewing a resource pool
- Hover over Creating to view the details about the creation process. Click View Details to go the operation record page.
Figure 11 Creating
- You can view the task records of the resource pool by clicking Records in the upper left corner of the resource pool list.
Figure 12 Operation recordsFigure 13 Viewing the resource pool status
After a resource pool is created, its status changes to Running. Click the cluster resource name to go to the resource details page. Check whether the purchased specifications are correct.
Figure 14 Viewing resource details
- Yearly/Monthly
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot