Updated on 2025-08-12 GMT+08:00

Scenario

The data volume and compute used for training vary among AI models. Select a proper storage and training solution to improve training efficiency and resource cost-effectiveness. ModelArts Standard supports multiple training scenarios to meet different requirements, including single-node single-card, single-node multi-card, and multi-node multi-card.

There are public resource pools and dedicated resource pools. If you use a dedicated resource pool, the resources are not shared with other users, ensuring high efficiency. For enterprises with multiple users, use a dedicated resource pool for AI model training.

This section provides an E2E guidance to help you understand how to select a proper training solution and perform model training on ModelArts Standard.

For different data volumes and algorithms, the solutions are as follows:

  • Single-node single-card: If the data volume is small (1 GB training data) and the compute is low (one Vnt1 card), use OBS parallel file system to store data and code.
  • Single-node multi-card: If the data volume is medium (50 GB training data) and the compute is medium (eight Vnt1 cards), use SFS to store data and code.
  • Multi-node multi-card: If the data volume is large (1 TB training data) and the compute is high (four nodes with eight Vnt1 cards), use SFS to store data and a common OBS bucket to store code, and use distributed training.
Table 1 Services required in different scenarios and purchase recommendations

Scenario

OBS

SFS

SWR

DEW

ModelArts

VPC

ECS

EVS

Single-node single-card

Pay-per-use (parallel file system)

×

Free

Free

Monthly

Free

×

Pay-per-use

Single-node multi-card

×

Monthly

(HPC 500 GB)

Free

Free

Monthly

Free

Monthly

(Ubuntu 18.04, at least 2 vCPUs and 8 GB memory, 100 GB local storage space, dynamic BGP with EIP, and 10 Mbit/s bandwidth)

×

Multi-node multi-card

Pay-per-use

(common OBS bucket)

Monthly

(HPC 500 GB)

Free

Free

Monthly

Free

Monthly

(Ubuntu 18.04, at least 2 vCPUs and 8 GB memory, 100 GB local storage space, dynamic BGP with EIP, and 10 Mbit/s bandwidth)

×

Table 2 Training performance of different open-source datasets

Algorithm and Data

Resource Flavor

Number of Epochs

Estimated Running Duration (hh:mm:ss)

Algorithm: PyTorch official example for ImageNet

Data: ImageNet classification data subset

One node with one Vnt1 card

10

0:05:03

Algorithm: YOLOX

Data: COCO 2017 dataset

One node with one Vnt1 card

10

03:33:13

One node with eight Vnt1 cards

10

01:11:48

Four nodes with eight Vnt1 cards

10

0:36:17

Algorithm: Swin-Transformer

Data: ImageNet21K

One node with one Vnt1 card

10

197:25:03

One node with eight Vnt1 cards

10

26:10:25

Four nodes with eight Vnt1 cards

10

07:08:44

Table 3 Average response time for different training operations

Operation

Description

Estimated Duration

Downloading an image

Time when an image (25 GB) is downloaded for the first time

8 minutes

Scheduling resources

Duration from the time when a training job starts to be created to the time when the training job becomes running (resources are sufficient and the image is cached)

20 seconds

Accessing the training list page

Time to access the training job list page with 50 records on it

6 seconds

Loading logs

Time to load 1 MB logs on the training details page

2.5 seconds

Accessing the training details page

Time to access the training details page where there are no logs

2.5 seconds

Accessing the JupyterLab page

Time to access the JupyterLab page and load the page content

0.5 seconds

Accessing the notebook list page

Time to access the notebook list page with 50 instances on it

4.5 seconds

The preceding data is for reference only. The image download time depends on the node specifications, disk type (high I/O or common I/O), and whether SSD is used.