Help Center/ ModelArts/ ModelArts User Guide (Standard)/ Using ModelArts Standard to Train Models/ Enabling Dynamic Route Acceleration for Training Jobs
Updated on 2025-08-18 GMT+08:00

Enabling Dynamic Route Acceleration for Training Jobs

Distributed training faces performance issues because networks struggle with slow data transfer and poor bandwidth usage when exchanging information across multiple nodes. To address these challenges, ModelArts offers dynamic routing acceleration. It smartly optimizes network paths for training jobs, boosting overall performance. This guide explains how to enable this feature on ModelArts, covering both preset frameworks and custom images. It also includes setup instructions and tips to maximize your distributed training results.

Notes and Constraints

Scenario 1: Using the Ascend-Powered-Engine Preset Image, MindSpore, and NPUs for Training

When using the Ascend-Powered-Engine preset image to create a training job, refer to Table 1 to create a training job and enable dynamic routing acceleration. The table below describes only key parameters. Configure other parameters based on actual needs.

Table 1 Using a preset image to create a training job

Step

Parameter

Description

Environment settings

Algorithm Type

Select Custom algorithm.

Boot Mode

Select Preset image.

Engine and Version

Select Ascend-Powered-Engine and a MindSpore-related engine version.

Code Directory

Select the OBS directory where the training code file is stored.

Dynamic routing acceleration improves network communication by adjusting the rank ID. To prevent communication issues, unify the rank usage in the code.

Boot File

Select the Python boot script of the training job in the code directory,

Training settings

Environment Variable

Add the following environment variables:

ROUTE_PLAN = true

Do not configure the environment variable MA_RUN_METHOD. Ensure that the boot file of the training job is started using the rank table file.

Resource settings

Resource Pool

Select a dedicated resource pool.

Specifications

Select instance specifications that meet the following requirements:

  • All processing units (PUs) on each node must be fully utilized. Otherwise, the effectiveness of dynamic routing acceleration may be impacted. For example, if a node has eight PUs, all eight must be used.
  • The resources must be Ascend Snt9b or Snt9b23.

Compute Nodes

Select at least three compute nodes.

Scenario 2: Using a Custom Image, PyTorch, and NPUs for Training

When using a custom image and Ascend resource pool to create a training job, refer to Table 2 to create a training job and enable dynamic routing acceleration. The table below describes only key parameters. Configure other parameters based on actual needs.

Table 2 Using a custom image to create a training job

Step

Parameter

Description

Environment settings

Algorithm Type

Select Custom algorithm.

Boot Mode

Select Custom image.

Image

Select a custom image for training. The training image must use the PyTorch framework.

Code Directory (Optional)

Select the OBS directory where the training code file is stored.

Dynamic routing acceleration improves network communication by adjusting the rank ID. To prevent communication issues, unify the rank usage in the code.

Boot Command

Enter the Python boot command of the image.

Modify the following code in the training boot script. The values vary according to the NPU hardware.

  • Snt9b scenario
    MASTER_ADDR="${MA_VJ_NAME}-${MA_TASK_NAME}-${MA_MASTER_INDEX:-0}.${MA_VJ_NAME}"
    NODE_RANK="${RANK_AFTER_ACC:-$VC_TASK_INDEX}"
  • Snt9b23 scenario
    MASTER_ADDR="${VC_WORKER_HOSTS%%,*}"
    NODE_RANK="${RANK_AFTER_ACC:-$VC_TASK_INDEX}"

Training settings

Environment Variable

Add the following environment variables:

ROUTE_PLAN = true

Resource settings

Resource Pool

Select a dedicated resource pool.

Specifications

Select instance specifications that meet the following requirements:

  • All processing units (PUs) on each node must be fully utilized. Otherwise, the effectiveness of dynamic routing acceleration may be impacted. For example, if a node has eight PUs, all eight must be used.
  • The resources must be Ascend Snt9b or Snt9b23.

Compute Nodes

Select at least three compute nodes.

Viewing Training Logs of Dynamic Route Acceleration

When using an Ascend resource pool for training, you can check whether the route is enabled and query the log information of each rank in the Logs tab of the training job details page.

  1. Log in to the ModelArts management console.
  2. In the navigation pane on the left, choose Model Training > Training Jobs.
  3. In the training job list, click the target job to access its details page.
  4. Click the Logs tab.

    You can view that dynamic routing has been enabled for the training job and search for logs by rank ID.

    Figure 1 Viewing dynamic route acceleration logs