Scenarios

The data volume and compute used for training vary among AI models. Select a proper storage and training solution to improve training efficiency and resource cost-effectiveness. ModelArts Standard supports multiple training scenarios to meet different requirements, including single-node single-PU, single-node multi-PU, and multi-node multi-PU.

There are public resource pools and dedicated resource pools. If you use a dedicated resource pool, the resources are not shared with other users, ensuring high efficiency. For enterprises with multiple users, use a dedicated resource pool for AI model training.

This section provides an E2E guidance to help you understand how to select a proper training solution and perform model training on ModelArts Standard.

For different data volumes and algorithms, the solutions are as follows:

Single-node single-PU: If the data volume is small (1 GB training data) and the compute is low (one Vnt1), use OBS parallel file system to store data and code.
Single-node multi-PU: If the data volume is medium (50 GB training data) and the compute is medium (eight Vnt1s), use SFS to store data and code.
Multi-node multi-PU: If the data volume is large (1 TB training data) and the compute is high (four nodes with eight Vnt1s), use SFS to store data and a common OBS bucket to store code, and use distributed training.

**Table 1** Services required in different scenarios and purchase recommendations
Scenario	OBS	SFS	SWR	DEW	ModelArts	VPC	ECS	EVS
Single-node single-PU	Pay-per-use (parallel file system)	×	Free	Free	Monthly	Free	×	Pay-per-use
Single-node multi-PU	×	Monthly (HPC 500 GB)	Free	Free	Monthly	Free	Monthly (Ubuntu 18.04, at least 2 vCPUs and 8 GB memory, 100 GB local storage space, dynamic BGP with EIP, and 10 Mbit/s bandwidth)	×
Multi-node multi-PU	Pay-per-use (common OBS bucket)	Monthly (HPC 500 GB)	Free	Free	Monthly	Free	Monthly (Ubuntu 18.04, at least 2 vCPUs and 8 GB memory, 100 GB local storage space, dynamic BGP with EIP, and 10 Mbit/s bandwidth)	×

**Table 2** Training performance of different open-source datasets
Algorithm and Data	Resource Flavor	Number of Epochs	Estimated Running Duration (hh:mm:ss)
Algorithm: PyTorch official example for ImageNet Data: ImageNet classification data subset	One node with one Vnt1	10	0:05:03
Algorithm: YOLOX Data: COCO 2017 dataset	One node with one Vnt1	10	03:33:13
	One node with eight Vnt1s	10	01:11:48
	Four nodes with eight Vnt1s	10	0:36:17
Algorithm: Swin-Transformer Data: ImageNet21K	One node with one Vnt1	10	197:25:03
	One node with eight Vnt1s	10	26:10:25
	Four nodes with eight Vnt1s	10	07:08:44

**Table 3** Average response time for different training operations
Operation	Description	Estimated Duration
Downloading an image	Time when an image (25 GB) is downloaded for the first time	8 minutes
Scheduling resources	Duration from the time when a training job starts to be created to the time when the training job becomes running (resources are sufficient and the image is cached)	20 seconds
Accessing the training list page	Time to access the training job list page with 50 records on it	6 seconds
Loading logs	Time to load 1 MB logs on the training details page	2.5 seconds
Accessing the training details page	Time to access the training details page where there are no logs	2.5 seconds
Accessing the JupyterLab page	Time to access the JupyterLab page and load the page content	0.5 seconds
Accessing the notebook list page	Time to access the notebook list page with 50 instances on it	4.5 seconds

The preceding data is for reference only. The image download time depends on the node specifications, disk type (high I/O or common I/O), and whether SSD is used.

Parent topic: Running a Training Job on ModelArts Standard

Previous topic: Running a Training Job on ModelArts Standard

Next topic: Preparations