Creating a Standard Dedicated Resource Pool
This section describes how to create a standard dedicated resource pool.
Prerequisites
- A VPC is available.
- A subnet is available.
Step 1: Create a Network
ModelArts networks are backed by VPCs and used for interconnecting nodes in a ModelArts resource pool. You can only configure the name and CIDR block for a network. To ensure that there is no IP address segment in the CIDR block overlapped with that of the VPC to be accessed, multiple CIDR blocks are available for you to select. A VPC provides a logically isolated virtual network for your instances. You can configure and manage the network as required. VPC provides logically isolated, configurable, and manageable virtual networks for cloud servers, cloud containers, and cloud databases. It helps you improve cloud service security and simplify network deployment.
- Log in to the ModelArts management console. In the navigation pane on the left, choose AI Dedicated Resource Pools > Elastic Clusters.
- Click the Network tab and click Create.
- In the Create Network dialog box, set parameters.
- Network Name: customizable name
- CIDR Block: You can select Preset or Custom. Recommended CIDR blocks for a custom network: 10.0.0.0/8-24, 172.16.0.0/12-24, and 192.168.0.0/16-24. The subnet mask ranges from 8 to 28.
Figure 1 Creating a network- Each user can create a maximum of 15 networks.
- Ensure there is no IP address segment in the CIDR block overlaps that of the VPC to be accessed. The CIDR block cannot be changed after the network is created. Possible conflict CIDR blocks are as follows:
- Your VPC CIDR block
- Container CIDR block (consistently to be 172.16.0.0/16)
- Service CIDR block (consistently to be 10.247.0.0/16)
- Confirm the settings and click OK.
(Optional) Step 2: Interconnect with a VPC
VPC interconnection allows you to use resources across VPCs, improving resource utilization.
- On the Network page, click Interconnect VPC in the Operation column of the target network.
Figure 2 Interconnecting the VPC
- In the displayed dialog box, click the button on the right of Interconnect VPC, and select an available VPC and subnet from the drop-down lists.
The peer network to be interconnected cannot overlap with the current CIDR block.
Figure 3 Parameters for interconnecting a VPC with a network- If no VPC is available, click Create VPC on the right to create a VPC.
- If no subnet is available, click Create Subnet on the right to create a subnet.
- A VPC can interconnect with at most 10 subnets. To add a subnet, click the plus sign (+).
- To enable a dedicated resource pool to access the public network through a VPC, create a SNAT in the VPC, as the public network address is unknown. After the VPC is interconnected, by default, the public address cannot be forwarded to the SNAT of your VPC. To add a default route, submit a service ticket and contact technical support. Then, when you interconnect with a VPC, ModelArts 0.0.0.0/0 is used as the default route. In this case, you do not need to submit a service ticket. Add the default route for network configuration.
Step 3: Create a Standard Dedicated Resource Pool
- Log in to the ModelArts console. In the navigation pane on the left, choose AI Dedicated Resource Pools > Elastic Clusters.
- In the Standard Resource Pool tab, click Buy AI Dedicated Cluster. On the displayed page, configure the parameters as follows.
Table 1 AI dedicated cluster parameters Parameter
Sub-Parameter
Description
Billing Mode
-
Select Yearly/Monthly or Pay-per-use.
- Yearly/Monthly is a prepaid billing mode in which your subscription is billed based on the required duration. This mode is more cost-effective when the usage duration is predictable.
- Pay-per-use is a postpaid billing mode. You are charged for how long you use each ECS. You can purchase or delete such an ECS at any time.
Cluster Specifications
Cluster Name
Enter a name.
Only lowercase letters, digits, and hyphens (-) are allowed. The value must start with a lowercase letter and cannot end with a hyphen (-).
Product Version
Select ModelArts Standard (Standard) in the ModelArts Standard scenario.
ModelArts Lite Elastic Cluster (native API) is used in the ModelArts Lite Cluster scenario. This parameter is displayed only in CN Southwest-Guiyang1.
Resource Pool Type
You can select Physical or Logical. If there is no logical specification, Logical is not displayed.
Elastic resources are not supported in physical resource pools, which feature higher isolation, physical isolation, dedicated networks, and network connectivity.
Elastic resources are supported in logical resource pools, which feature faster creation and scaling.
Job Type
Choose DevEnviron, Training Job, or Inference Service as needed.
Advanced Configuration
- For cluster specifications, retain the default settings or customize the specifications. When customizing the specifications, you can set the cluster scale and enable HA for controller nodes.
- Configure the cluster scale based on the service scenario. The scale refers to the maximum number of instances that can be managed by a resource pool.
- Once HA is enabled for controller nodes, the system creates three control plane nodes for your cluster to ensure reliability. If there are 1,000 or 2,000 nodes in the cluster, HA must be enabled. If HA is disabled, only one control plane node will be created for your cluster. After a resource pool is created, the HA status of controller nodes cannot be changed.
- Master node distribution: You can select random allocation or specify an AZ. Distribute controller nodes in different AZs for disaster recovery.
- Random allocation: The system randomly allocates controller nodes to AZs to improve disaster recovery capabilities. If the number of available AZs is less than the number of nodes to be created, the nodes will be created in the AZs with sufficient resources to preferentially ensure cluster creation. In this case, AZ-level DR may not be ensured.
- You can also specify an AZ for the controller nodes.
Network
ModelArts network
Specifies the network where the resource pool runs. The network can communicate with other cloud service resource instances on the network. The network needs to be set only for physical resource pools.
Select a network from the drop-down list box. If no network is available, click Create on the right to create one. For details about how to create a network, see Step 1: Create a Network.
IPv6 Network
Whether to enable IPv6 networks. If enabled, you must enable IPv6 for the network bound to the resource pool. Once enabled, this function cannot be disabled. For details about how to enable an IPv6 network, see Step 1: Create a Network.
Default Setting
CPU Architecture
The CPU architecture refers to the command set and design specifications of the CPU. x86 and Arm64 are supported. Set these parameters as required.
Instance Specifications Type
Choose CPU, GPU, or Ascend processors as needed.
Specifications
Select the required specifications from the drop-down list. Due to system loss, the available resources are fewer than specified. After a dedicated resource pool is created, view the available resources in the Nodes tab on the details page.
Contact your account manager to request resource specifications (such as Ascend) in advance. They will enable the specifications within one to three working days. If there is no account manager, submit a service ticket.
AZ
You can select Automatically allocated or Specifies AZ. An AZ is a physical region where resources use independent power supplies and networks. AZs are physically isolated but interconnected over an intranet.
- Automatically allocated: AZs are automatically allocated.
- Specifies AZ: Specify AZs for resource pool nodes. To ensure system disaster recovery, deploy all nodes in the same AZ. You can set the number of instances in an AZ.
Instances
Select the number of instances in a dedicated resource pool. More instances mean higher computing performance.
If AZ Allocation is set to Manual, you do not need to configure Instances.
NOTE:It is a good practice to create no more than 30 instances at a time. Otherwise, the creation may fail due to traffic limiting.
For certain specifications, you can purchase instances by rack. The instances you purchase is the number of racks multiplied by rack(6 node). Purchasing a full rack allows you to isolate tasks physically, preventing communication conflicts and maintaining linear computing performance as task scale increases. All instances in a rack must be created or deleted together.
Advanced Node Configuration
After Advanced Node Settings is enabled, you can set the operating system of the instance.
Storage
Some flavors support the Storage Configuration switch, which is disabled by default.
System Disk
After enabling Storage Configuration, you can view the default system disk type, size, and quantity of each instance.
Some specifications do not contain system disks. You can set the type and size of system disks when creating a dedicated resource pool.
Container Disk
After Storage Configuration is enabled, you can view the type, size, and quantity of container disks of each instance. The container disk type can only be local disk or EVS disk and cannot be changed.
Some specifications do not contain container disks. You can set the type and size of container disks when creating a dedicated resource pool. Only EVS disks, including common SSO, high I/O, and ultra-high I/O disks, are supported.
Add Container Data Disk
For some specifications, you can mount additional container disks to each instance in the dedicated resource pool. To do so, click the plus sign (+) before Add Container Disk. The attached disks are EVS disks, which will be charged independently.
You can set the type, size, and number of disks to be mounted. The actual values are displayed on the console.
Container Disk Advanced Configuration - Disk Space
Container space: The data disk space is divided into two parts by default. One part is used to store the Docker/containerd working directories, container image data, and image metadata. The other is reserved for kubelet and emptyDir volumes. You can set the Specify disk space parameter to set the ratio of the sizes of the two partitions. The available container engine space affects image pulls and container startup and running.
If the container disk is a local disk, Specify Disk Space is not supported.
Container Disk Advanced Configuration - Container Engine Space Size
This parameter specifies the size of the space allocated to the pod container. Only integers are supported. The default and minimum values are 50 GiB. The maximum value depends on the specifications, and can be found in the console prompt. Customizing the container engine space does not increase costs.
By specifying this parameter, you can limit the disk size used by a single pod job.
Container Disk Advanced Settings - Write Mode
Some flavors allow you to set the disk write mode, which can be Linear or Stripe.
- Linear: A linear logical volume integrates one or more physical volumes. Data is written to the next physical volume when the previous one is used up.
- Striped: A striped logical volume stripes data into blocks of the same size and stores them in multiple physical volumes in sequence. This allows data to be concurrently read and written. A storage pool consisting of striped volumes cannot be scaled-out.
Creating a Flavor
Add multiple specifications as needed. Restrictions:
- Each flavor must be unique.
- The CPU architectures of multiple specifications must be the same, which can be either x86 or Arm.
- When selecting multiple GPU or NPU specifications, distributed training speed is impacted because different specifications' parameter network planes are not connected. For distributed training, you are advised to choose only one GPU or NPU specification.
- You can add up to 10 specifications to a resource pool.
Resource scheduling and allocation
Custom Driver
Disabled by default. Some GPU and Ascend resource pools allow custom driver installation. The driver is automatically installed in the cluster by default. Enable this function if you need to specify the driver version.
GPU/Ascend Driver
This parameter is displayed if Custom Driver is enabled. You can select a GPU or Ascend driver. The value depends on the driver you choose.
Enabling HA redundancy
-
- Enable HA redundancy: Whether to enable HA redundancy for the resource pool. By default, HA redundancy is enabled for supernodes.
- Redundant node distribution policy: indicates the distribution policy of redundant nodes. Supernodes support only step-based even distribution. The same number of redundant nodes are reserved in each supernode.
- Number of redundant instances: number of HA redundant instances set for this flavor. The redundancy coefficient refers to a quantity of redundant nodes reserved in each supernode when the redundant node distribution policy is step-based even distribution.
NOTE:Currently, only the Snt9C flavor supports this function.
Advanced Options
(Optional) Cluster Description
Enter the cluster description for easy query.
Tags
Click Add Tag to configure tags for the standard resource pool so that resources can be managed by tag. The tag information can be predefined in Tag Management Service (TMS) or customized. You can also set tag information in the Tags tab of the details page after the standard resource pool is created.
NOTE:Predefined TMS tags are available to all service resources that support tags. Customized tags are available only to the service resources of the user who has created the tags.
CIDR Block
You can select Default or Custom.
- Default: The system randomly allocates an available CIDR block to you, which cannot be modified after the resource pool is created. For commercial use, customize your CIDR block.
- Custom: You need to customize Kubernetes container and Kubernetes service CIDR blocks.
- K8S Container Network: used by the container in a cluster, which determines how many containers there can be in a cluster. The value cannot be changed after the resource pool is created.
- Kubernetes Service CIDR Block: CIDR block for services used by containers in the same cluster to access each other. The value determines the maximum number of Services you can create. The value cannot be changed after the resource pool is created.
Required Duration
-
Select the time length for which you want to use the resource pool. This parameter is mandatory only when the Yearly/Monthly billing mode is selected.
Auto-renewal
Specifies whether to enable auto-renewal. This parameter is mandatory only when the Yearly/Monthly billing mode is selected.
- Monthly subscriptions renew each month.
- Yearly subscriptions renew each year.
- Click Buy Now .
- After a resource pool is created, its status changes to Running. Only when the number of available nodes is greater than 0, tasks can be delivered to this resource pool.
- Hover over Creating to view the details about the creation process. Click View Details to go the operation record page.
- You can view the task records of the resource pool by clicking Records in the upper left corner of the resource pool list.
FAQs
What if I choose a flavor for a dedicated resource pool, but get an error message saying no resource is available?
The flavors of dedicated resources change based on real-time availability. Sometimes, you might choose a flavor on the purchase page, but it is sold out before you pay and create the resource pool. This causes the resource pool creation to fail.
You can try a different flavor on the creation page and create the resource pool again.
Q: Why cannot I use all the CPU resources on a node in a resource pool?
Resource pool nodes have systems and plug-ins installed on them. These take up some CPU resources. For example, if a node has 8 vCPUs, but some of them are used by system components, the available resources will be fewer than 8 vCPUs.
You can check the available CPU resources by clicking the Nodes tab on the resource pool details page, before you start a task.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot