Overview
Distributed Training
Distributed training speeds up deep learning by running tasks simultaneously across several compute nodes like servers or GPUs, enabling faster model training or handling bigger datasets. Distributed training splits a training task across multiple nodes, where each node processes a portion of the model. These nodes share their computed results through a communication system, enabling the full model to be trained efficiently. This approach greatly enhances training speed, particularly for complex models and large datasets.
ModelArts enables distributed training by automatically managing node communications and resources for effective parallel processing.
ModelArts provides the following capabilities:
- Extensive built-in images, meeting your requirements
- Custom development environments set up using built-in images
- Extensive tutorials, helping you quickly understand distributed training
- Distributed training debugging in development tools such as PyCharm, VS Code, and JupyterLab
It supports the following two approaches:
- Single-node multi-PU data parallelism (DP): Multiple GPUs work together on one server to speed up training using data parallelism. This approach maximizes the use of all available GPU resources on a single server.
- Distributed data parallelism (DDP): Multiple servers work together, with each server using several GPUs to increase training capacity. It works well for handling large datasets or complex models.
Constraints
- If the notebook instance flavors are changed, you can only perform single-node debugging. You cannot perform distributed debugging or submit remote training jobs.
- Only the PyTorch and MindSpore AI frameworks can be used for multi-node distributed debugging. If you want to use MindSpore, each node must be equipped with eight PUs.
- The OBS paths in the debugging code should be replaced with your OBS paths.
- PyTorch is used to write debugging code in this document. The process is the same for different AI frameworks. You only need to modify some parameters.
Billing
Model training in ModelArts uses compute and storage resources, which are billed. Compute resources are billed for running training jobs. Storage resources are billed for storing data in OBS or SFS. For details, see Model Training Billing Items.
Advantages and Disadvantages of Single-Node Multi-PU Training Using DataParallel
- Straightforward coding: Only one line of code needs to be modified.
- Bottlenecks in communication: The master GPU is used to update and distribute parameter settings, which causes high communication costs.
- Unbalanced GPU loading: The master GPU is used to summarize outputs, calculate loss, and update weights. Therefore, the GPU memory and usage are higher than those of other GPUs.
Advantages of Multi-Node Multi-PU Training Using DistributedDataParallel
- Fast communication
- Balanced load
- Fast running speed
Related Chapters
- Creating a Single-Node Multi-PU Distributed Training Job (DataParallel): describes single-node multi-PU training using DataParallel, and corresponding code modifications.
- Creating a Multiple-Node Multi-PU Distributed Training Job (DistributedDataParallel): describes multi-node multi-PU training using DistributedDataParallel, and corresponding code modifications.
- Example: Creating a DDP Distributed Training Job (PyTorch + GPU): describes the procedure and code example of distributed debugging adaptation.
- Example: Creating a DDP Distributed Training Job (PyTorch + NPU): provides a complete code sample of distributed parallel training for the classification task of ResNet18 on the CIFAR-10 dataset.
- Debugging a Training Job: describes how to use the SDK to debug a single-node or multi-node training job on the ModelArts development environment.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot