Creating a Single-Node Multi-PU Distributed Training Job (DataParallel)

As deep learning models grow larger, their training times increase. Efficient parallel computing methods are essential for faster training. A major challenge in single-node setups is maximizing the use of multiple GPUs. This section explains how to conduct single-node, multi-GPU data parallel training using PyTorch. Properly splitting data and syncing models across devices allows full utilization of GPU resources, significantly boosting training speed.

For details about distributed training with the MindSpore engine, visit the MindSpore official website. You can select the required version in the upper left corner.

Training Process

The process of single-node multi-PU parallel training is as follows:

A model is copied to multiple GPUs.
Data of each batch is distributed evenly to each worker GPU.
Each GPU does its own forward propagation and an output is obtained.
The master GPU with device ID 0 collects the output of each GPU and calculates the loss.
The master GPU distributes the loss to each worker GPU. Each GPU does its own backward propagation and calculates the gradient.
The master GPU collects gradients, updates parameter settings, and distributes the settings to each worker GPU.

The detailed flowchart is as follows.

Figure 1 Single-node multi-PU parallel training

Code Modifications

Model distribution: DataParallel(model)

The code is slightly changed and the following is a simple example:

import torch
class Net(torch.nn.Module):
	pass

model = Net().cuda()

### DataParallel Begin ###
model = torch.nn.DataParallel(Net().cuda())
### DataParallel End ###

Parent topic: Distributed Model Training

Previous topic: Overview

Next topic: Creating a Multiple-Node Multi-PU Distributed Training Job (DistributedDataParallel)