Single-Node Multi-Card Training Using DataParallel

This section describes how to perform single-node multi-card parallel training based on the PyTorch engine.

For details about the distributed training using the MindSpore engine, see the MindSpore official website.

Training Process

The process of single-node multi-card parallel training is as follows:

A model is copied to multiple GPUs.
Data of each batch is distributed evenly to each worker GPU.
Each GPU does its own forward propagation and an output is obtained.
The master GPU with device ID 0 collects the output of each GPU and calculates the loss.
The master GPU distributes the loss to each worker GPU. Each GPU does its own backward propagation and calculates the gradient.
The master GPU collects gradients, updates parameter settings, and distributes the settings to each worker GPU.

The detailed flowchart is as follows.

Figure 1 Single-node multi-card parallel training

Advantages and Disadvantages

Straightforward coding: Only one line of code needs to be modified.
Bottlenecks in communication: The master GPU is used to update and distribute parameter settings, which causes high communication costs.
Unbalanced GPU loading: The master GPU is used to summarize outputs, calculate loss, and update weights. Therefore, the GPU memory and usage are higher than those of other GPUs.

Code Modifications

Model distribution: DataParallel(model)

The code is slightly changed and the following is a simple example:

import torch
class Net(torch.nn.Module):
	pass

model = Net().cuda()

### DataParallel Begin ###
model = torch.nn.DataParallel(Net().cuda())
### DataParallel End ###

Parent topic: Distributed Training

Previous topic: Distributed Training Functions

Next topic: Multi-Node Multi-Card Training Using DistributedDataParallel