Scenario
The data volume and compute used for training vary among AI models. Select a proper storage and training solution to improve training efficiency and resource cost-effectiveness. ModelArts Standard supports multiple training scenarios to meet different requirements, including single-node single-card, single-node multi-card, and multi-node multi-card.
There are public resource pools and dedicated resource pools. If you use a dedicated resource pool, the resources are not shared with other users, ensuring high efficiency. For enterprises with multiple users, use a dedicated resource pool for AI model training.
This section provides an E2E guidance to help you understand how to select a proper training solution and perform model training on ModelArts Standard.
For different data volumes and algorithms, the solutions are as follows:
- Single-node single-card: If the data volume is small (1 GB training data) and the compute is low (one Vnt1 card), use OBS parallel file system to store data and code.
- Single-node multi-card: If the data volume is medium (50 GB training data) and the compute is medium (eight Vnt1 cards), use SFS to store data and code.
- Multi-node multi-card: If the data volume is large (1 TB training data) and the compute is high (four nodes with eight Vnt1 cards), use SFS to store data and a common OBS bucket to store code, and use distributed training.
Scenario |
OBS |
SFS |
SWR |
DEW |
ModelArts |
VPC |
ECS |
EVS |
---|---|---|---|---|---|---|---|---|
Single-node single-card |
Pay-per-use (parallel file system) |
× |
Free |
Free |
Monthly |
Free |
× |
Pay-per-use |
Single-node multi-card |
× |
Monthly (HPC 500 GB) |
Free |
Free |
Monthly |
Free |
Monthly (Ubuntu 18.04, at least 2 vCPUs and 8 GB memory, 100 GB local storage space, dynamic BGP with EIP, and 10 Mbit/s bandwidth) |
× |
Multi-node multi-card |
Pay-per-use (common OBS bucket) |
Monthly (HPC 500 GB) |
Free |
Free |
Monthly |
Free |
Monthly (Ubuntu 18.04, at least 2 vCPUs and 8 GB memory, 100 GB local storage space, dynamic BGP with EIP, and 10 Mbit/s bandwidth) |
× |
Algorithm and Data |
Resource Flavor |
Number of Epochs |
Estimated Running Duration (hh:mm:ss) |
---|---|---|---|
Algorithm: PyTorch official example for ImageNet Data: ImageNet classification data subset |
One node with one Vnt1 card |
10 |
0:05:03 |
Algorithm: YOLOX Data: COCO 2017 dataset |
One node with one Vnt1 card |
10 |
03:33:13 |
One node with eight Vnt1 cards |
10 |
01:11:48 |
|
Four nodes with eight Vnt1 cards |
10 |
0:36:17 |
|
Algorithm: Swin-Transformer Data: ImageNet21K |
One node with one Vnt1 card |
10 |
197:25:03 |
One node with eight Vnt1 cards |
10 |
26:10:25 |
|
Four nodes with eight Vnt1 cards |
10 |
07:08:44 |
Operation |
Description |
Estimated Duration |
---|---|---|
Downloading an image |
Time when an image (25 GB) is downloaded for the first time |
8 minutes |
Scheduling resources |
Duration from the time when a training job starts to be created to the time when the training job becomes running (resources are sufficient and the image is cached) |
20 seconds |
Accessing the training list page |
Time to access the training job list page with 50 records on it |
6 seconds |
Loading logs |
Time to load 1 MB logs on the training details page |
2.5 seconds |
Accessing the training details page |
Time to access the training details page where there are no logs |
2.5 seconds |
Accessing the JupyterLab page |
Time to access the JupyterLab page and load the page content |
0.5 seconds |
Accessing the notebook list page |
Time to access the notebook list page with 50 instances on it |
4.5 seconds |

The preceding data is for reference only. The image download time depends on the node specifications, disk type (high I/O or common I/O), and whether SSD is used.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot