Help Center/ ModelArts/ Model Development/ Distributed Training/ Multi-Node Multi-Card Training Using DistributedDataParallel
Updated on 2024-05-07 GMT+08:00

Multi-Node Multi-Card Training Using DistributedDataParallel

This section describes how to perform multi-node multi-card parallel training based on the PyTorch engine.

Training Process

Compared with DataParallel, DistributedDataParallel can start multiple processes for computing, greatly improving compute resource usage. Based on torch.distributed, DistributedDataParallel has obvious advantages over DataParallel in the distributed computing case. The process is as follows:

  1. Initializes the process group.
  2. Creates a distributed parallel model. Each process has the same model and parameters.
  3. Creates a distributed sampler for data distribution to enable each process to load a unique subset of the original dataset in a mini batch.
  4. Parameters are organized into buckets based on their shapes or sizes, which are generally determined by each layer of the network that requires parameter update in a neural network model.
  5. Each process does its own forward propagation and computes its gradient.
  6. After all parameter gradients at a bucket are obtained, communication is performed for gradient averaging.
  7. Each GPU updates model parameters.

The detailed flowchart is as follows.

Figure 1 Multi-node multi-card parallel training

Advantages

  • Fast communication
  • Balanced load
  • Fast running speed

Code Modifications

  • Multi-process startup
  • New variables such as rank ID and world_size are used along with the TCP protocol.
  • Sampler for data distribution to avoid duplicate data between different processes
  • Model distribution: DistributedDataParallel(model)
  • Model saved in GPU 0
import torch
class Net(torch.nn.Module):
	pass

model = Net().cuda()

### DistributedDataParallel Begin ###
model = torch.nn.parallel.DistributedDataParallel(Net().cuda())
### DistributedDataParallel End ###

Related Operations