Multi-Node Multi-Card Training Using DistributedDataParallel

This section describes how to perform multi-node multi-card parallel training based on the PyTorch engine.

Training Process

Compared with DataParallel, DistributedDataParallel can start multiple processes for computing, greatly improving compute resource usage. Based on torch.distributed, DistributedDataParallel has obvious advantages over DataParallel in the distributed computing case. The process is as follows:

Initializes the process group.
Creates a distributed parallel model. Each process has the same model and parameters.
Creates a distributed sampler for data distribution to enable each process to load a unique subset of the original dataset in a mini batch.
Parameters are organized into buckets based on their shapes or sizes, which are generally determined by each layer of the network that requires parameter update in a neural network model.
Each process does its own forward propagation and computes its gradient.
After all parameter gradients at a bucket are obtained, communication is performed for gradient averaging.
Each GPU updates model parameters.

The detailed flowchart is as follows.

Figure 1 Multi-node multi-card parallel training
Click to enlarge

Advantages

Fast communication
Balanced load
Fast running speed

Code Modifications

Multi-process startup
New variables such as rank ID and world_size are used along with the TCP protocol.
Sampler for data distribution to avoid duplicate data between different processes
Model distribution: DistributedDataParallel(model)
Model saved in GPU 0

import torch
class Net(torch.nn.Module):
	pass

model = Net().cuda()

### DistributedDataParallel Begin ###
model = torch.nn.parallel.DistributedDataParallel(Net().cuda())
### DistributedDataParallel End ###

Related Operations

For details about distributed debugging adaptation and code example, see Distributed Debugging Adaptation and Code Example.
This document also provides a complete code sample of distributed parallel training for the classification task of ResNet18 on the cifar10 dataset. For details, see Sample Code of Distributed Training.

Parent topic: Distributed Training

Previous topic: Single-Node Multi-Card Training Using DataParallel

Next topic: Distributed Debugging Adaptation and Code Example