Overview

Distributed Training

Distributed training speeds up deep learning by running tasks simultaneously across several compute nodes like servers or GPUs, enabling faster model training or handling bigger datasets. Distributed training splits a training task across multiple nodes, where each node processes a portion of the model. These nodes share their computed results through a communication system, enabling the full model to be trained efficiently. This approach greatly enhances training speed, particularly for complex models and large datasets.

ModelArts enables distributed training by automatically managing node communications and resources for effective parallel processing.

ModelArts provides the following capabilities:

Extensive built-in images, meeting your requirements
Custom development environments set up using built-in images
Extensive tutorials, helping you quickly understand distributed training
Distributed training debugging in development tools such as PyCharm, VS Code, and JupyterLab

It supports the following two approaches:

Single-node multi-PU data parallelism (DP): Multiple GPUs work together on one server to speed up training using data parallelism. This approach maximizes the use of all available GPU resources on a single server.
Distributed data parallelism (DDP): Multiple servers work together, with each server using several GPUs to increase training capacity. It works well for handling large datasets or complex models.

Constraints

If the notebook instance flavors are changed, you can only perform single-node debugging. You cannot perform distributed debugging or submit remote training jobs.
Only the PyTorch and MindSpore AI frameworks can be used for multi-node distributed debugging. If you want to use MindSpore, each node must be equipped with eight PUs.
The OBS paths in the debugging code should be replaced with your OBS paths.
PyTorch is used to write debugging code in this document. The process is the same for different AI frameworks. You only need to modify some parameters.

Billing

Model training in ModelArts uses compute and storage resources, which are billed. Compute resources are billed for running training jobs. Storage resources are billed for storing data in OBS or SFS. For details, see Model Training Billing Items.

Advantages and Disadvantages of Single-Node Multi-PU Training Using DataParallel

Straightforward coding: Only one line of code needs to be modified.
Bottlenecks in communication: The master GPU is used to update and distribute parameter settings, which causes high communication costs.
Unbalanced GPU loading: The master GPU is used to summarize outputs, calculate loss, and update weights. Therefore, the GPU memory and usage are higher than those of other GPUs.

Advantages of Multi-Node Multi-PU Training Using DistributedDataParallel

Fast communication
Balanced load
Fast running speed

Related Chapters

Creating a Single-Node Multi-PU Distributed Training Job (DataParallel): describes single-node multi-PU training using DataParallel, and corresponding code modifications.
Creating a Multiple-Node Multi-PU Distributed Training Job (DistributedDataParallel): describes multi-node multi-PU training using DistributedDataParallel, and corresponding code modifications.
Example: Creating a DDP Distributed Training Job (PyTorch + GPU): describes the procedure and code example of distributed debugging adaptation.
Example: Creating a DDP Distributed Training Job (PyTorch + NPU): provides a complete code sample of distributed parallel training for the classification task of ResNet18 on the CIFAR-10 dataset.
Debugging a Training Job: describes how to use the SDK to debug a single-node or multi-node training job on the ModelArts development environment.