Help Center/
ModelArts/
ModelArts User Guide (Standard)/
Model Training/
Distributed Model Training/
Overview
Updated on 2024-10-29 GMT+08:00
Overview
ModelArts provides the following capabilities:
- Extensive built-in images, meeting your requirements
- Custom development environments set up using built-in images
- Extensive tutorials, helping you quickly understand distributed training
- Distributed training debugging in development tools such as PyCharm, VS Code, and JupyterLab
Constraints
- If the instance flavors are changed, you can only perform single-node debugging. You cannot perform distributed debugging or submit remote training jobs.
- Only the PyTorch and MindSpore AI frameworks can be used for multi-node distributed debugging. If you want to use MindSpore, each node must be equipped with eight cards.
- The OBS paths in the debugging code should be replaced with your OBS paths.
- PyTorch is used to write debugging code in this document. The process is the same for different AI frameworks. You only need to modify some parameters.
Advantages and Disadvantages of Single-Node Multi-Card Training Using DataParallel
- Straightforward coding: Only one line of code needs to be modified.
- Bottlenecks in communication: The master GPU is used to update and distribute parameter settings, which causes high communication costs.
- Unbalanced GPU loading: The master GPU is used to summarize outputs, calculate loss, and update weights. Therefore, the GPU memory and usage are higher than those of other GPUs.
Advantages of Multi-Node Multi-Card Training Using DistributedDataParallel
- Fast communication
- Balanced load
- Fast running speed
Related Chapters
- Creating a Single-Node Multi-Card Distributed Training Job (DataParallel): describes single-node multi-card training using DataParallel, and corresponding code modifications.
- Creating a Multiple-Node Multi-Card Distributed Training Job (DistributedDataParallel): describes multi-node multi-card training using DistributedDataParallel, and corresponding code modifications.
- Example: Creating a DDP Distributed Training Job (PyTorch + GPU): describes the procedure and code example of distributed debugging adaptation.
- Example: Creating a DDP Distributed Training Job (PyTorch + NPU): provides a complete code sample of distributed parallel training for the classification task of ResNet18 on the CIFAR-10 dataset.
- Debugging a Training Job: describes how to use the SDK to debug a single-node or multi-node training job on the ModelArts development environment.
Parent topic: Distributed Model Training
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
The system is busy. Please try again later.
For any further questions, feel free to contact us through the chatbot.
Chatbot