Distributed Training Functions

ModelArts provides the following capabilities:

Extensive built-in images, meeting your requirements
Custom development environments set up using built-in images
Extensive tutorials, helping you quickly understand distributed training
Distributed training debugging in development tools such as PyCharm, VS Code, and JupyterLab

If the instance flavors are changed, you can only perform single-node debugging. You cannot perform distributed debugging or submit remote training jobs.
Only the PyTorch and MindSpore AI frameworks can be used for multi-node distributed debugging. If you want to use MindSpore, each node must be equipped with eight cards.
The OBS paths in the debugging code should be replaced with your OBS paths.
PyTorch is used to write debugging code in this document. The process is the same for different AI frameworks. You only need to modify some parameters.

Single-Node Multi-Card Training Using DataParallel: describes single-node multi-card training using DataParallel, and corresponding code modifications.
Multi-Node Multi-Card Training Using DistributedDataParallel: describes multi-node multi-card training using DistributedDataParallel, and corresponding code modifications.
Distributed Debugging Adaptation and Code Example: describes the procedure and code example of distributed debugging adaptation.
Sample Code of Distributed Training: provides a complete code sample of distributed parallel training for the classification task of ResNet18 on the CIFAR-10 dataset.
Debugging a Training Job: describes how to use the SDK to debug a single-node or multi-node training job on the ModelArts development environment.