Updated on 2024-06-21 GMT+08:00

PyTorch

ModelArts provides multiple AI frameworks for different engines. When you use these engines for model training, the boot commands during training need to be adapted accordingly. This section introduces how to make adaptions to the PyTorch engine.

PyTorch Startup Principle

Specifications and number of nodes

In this case, GPU: 8 × GP-Vnt1 | CPU: 72 cores | Memory: 512 GB is used as an example to describe how to allocate ModelArts resources for single-node and distributed jobs.

For a single-node job (running on only one node), ModelArts starts a training container that exclusively uses the resources on the node.

For a distributed job (running on more than one node), there are as many workers as the nodes that are selected during job creation. Each worker is allocated with the compute resources of the selected specification. For example, if there are 2 compute nodes, two workers will be started, and each worker owns the compute resources of GPU: 8 × GP-Vnt1 | CPU: 72 cores | Memory: 512 GB.

Network communication

  • For a single-node job, no network communication is required.
  • For a distributed job, network communications are required in nodes and between nodes.

In nodes

NVLink and shared memory are used for communication.

Between nodes

If there is more than one compute node, PyTorch distributed training will be started. The following figure shows the network communications between workers in PyTorch distributed training. Workers can communicate with each other using the container network and a 100-Gbit/s InfiniBand or RoCE NIC. RoCE NICs are described specifically for certain specifications. The containers can communicate through DNS domain names, which is suitable for small-scale point-to-point communication that requires average network performance. The InfiniBand and RoCE NICs are suitable for distributed training jobs using collective communication that require high-performance network.

Figure 1 Network communications for distributed training

Boot Commands

The training service uses the default python interpreter in the job image to start the training script. To obtain the python interpreter, run the which python command. The working directory during startup is /home/ma-user/user-job-dir/<The code directory name>, which is the directory returned by running pwd or os.getcwd() in python.

  • Boot command for single-card single-node
    python <Relative path of the startup file> <Job parameters>
    • Relative path of the startup file: path of the startup file relative to /home/ma-user/user-job-dir/<The code directory name>
    • Job parameters: parameters configured for a training job
    Figure 2 Creating a training job

    Configure the parameters by referring to the above figure. Then, run the following command on the console background:

    python /home/ma-user/modelarts/user-job-dir/gpu-train/train.py --epochs 5
  • Boot command for multi-cards single-node
    python <Relative path of the startup file> --init_method "tcp://${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}:${port}" <Job parameters>
    • Relative path of the startup file: path of the startup file relative to /home/ma-user/user-job-dir/<The code directory name>
    • ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}: domain name of the container where worker-0 is located. For details, see Default environment variables.
    • port: default communication port of the container where worker-0 is located
    • Job parameters: parameters configured for a training job
    Figure 3 Creating a training job

    Configure the parameters by referring to the above figure. Then, run the following command on the console background:

    python /home/ma-user/modelarts/user-job-dir/gpu-train/train.py --init_method "tcp://${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}:${port}" --epochs 5
  • Boot command for multi-cards multi-nodes
    python <Relative path of the startup file> --init_method "tcp://${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}:${port}" --rank <rank_id> --world_size <node_num> <Job parameters>
    • Relative path of the startup file: path of the startup file relative to /home/ma-user/user-job-dir/<The code directory name>
    • ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}: domain name of the container where worker-0 is located. For details, see Default environment variables.
    • port: default communication port of the container where worker-0 is located
    • rank: worker serial number
    • node_num: number of workers
    • Job parameters: parameters configured for a training job
    Figure 4 Creating a training job

    Configure the parameters by referring to the above figure. Then, run the following command on the console background:

    python /home/ma-user/modelarts/user-job-dir/gpu-train/train.py --init_method "tcp://${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}:${port}" --rank "${rank_id}" --world_size "${node_num}" --epochs 5