Multi-Node Multi-Card Training Using DistributedDataParallel
This section describes how to perform multi-node multi-card parallel training based on the PyTorch engine.
Training Process
Compared with DataParallel, DistributedDataParallel can start multiple processes for computing, greatly improving compute resource usage. Based on torch.distributed, DistributedDataParallel has obvious advantages over DataParallel in the distributed computing case. The process is as follows:
- Initializes the process group.
- Creates a distributed parallel model. Each process has the same model and parameters.
- Creates a distributed sampler for data distribution to enable each process to load a unique subset of the original dataset in a mini batch.
- Parameters are organized into buckets based on their shapes or sizes, which are generally determined by each layer of the network that requires parameter update in a neural network model.
- Each process does its own forward propagation and computes its gradient.
- After all parameter gradients at a bucket are obtained, communication is performed for gradient averaging.
- Each GPU updates model parameters.
The detailed flowchart is as follows.
Advantages
- Fast communication
- Balanced load
- Fast running speed
Code Modifications
- Multi-process startup
- New variables such as rank ID and world_size are used along with the TCP protocol.
- Sampler for data distribution to avoid duplicate data between different processes
- Model distribution: DistributedDataParallel(model)
- Model saved in GPU 0
import torch class Net(torch.nn.Module): pass model = Net().cuda() ### DistributedDataParallel Begin ### model = torch.nn.parallel.DistributedDataParallel(Net().cuda()) ### DistributedDataParallel End ###
Related Operations
- For details about distributed debugging adaptation and code example, see Distributed Debugging Adaptation and Code Example.
- This document also provides a complete code sample of distributed parallel training for the classification task of ResNet18 on the cifar10 dataset. For details, see Sample Code of Distributed Training.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot