Updated on 2025-08-18 GMT+08:00

Managing Environment Variables of a Training Container

What Is an Environment Variable

This section describes environment variables preset in a training container. The environment variables include:

  • Path environment variables
  • Environment variables of a distributed training job
  • Nvidia Collective multi-GPU Communication Library (NCCL) environment variables
  • OBS environment variables
  • Environment variables of the pip source
  • Environment variables of the API Gateway address
  • Environment variables of job metadata

Notes and Constraints

When defining custom environment variables, avoid using names that start with MA_ to prevent conflicts with system environment variables.

Configuring Environment Variables

When you create a training job, you can add environment variables or modify environment variables preset in the training container.

To ensure data security, do not enter sensitive information, such as plaintext passwords.

Environment Variables Preset in a Training Container

Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, and Table 7 list environment variables preset in a training container.

The environment variable values are examples only.

Table 1 Path environment variables

Variable

Description

Example

PATH

Executable file paths

PATH=/usr/local/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

LD_LIBRARY_PATH

Dynamic load library paths

LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/usr/local/cuda/lib64:/usr/local/cuda/compat:/root/miniconda3/lib:/usr/local/lib:/usr/local/nvidia/lib64

LIBRARY_PATH

Static library paths

LIBRARY_PATH=/usr/local/cuda/lib64/stubs

MA_HOME

Main directory of a training job

MA_HOME=/home/ma-user

MA_JOB_DIR

Parent directory of the training algorithm folder

MA_JOB_DIR=/home/ma-user/modelarts/user-job-dir

MA_MOUNT_PATH

Path mounted to a ModelArts training container, which is used to temporarily store training algorithms, algorithm input, algorithm output, and logs

MA_MOUNT_PATH=/home/ma-user/modelarts

MA_LOG_DIR

Training log directory

MA_LOG_DIR=/home/ma-user/modelarts/log

MA_SCRIPT_INTERPRETER

Training script interpreter

MA_SCRIPT_INTERPRETER=

WORKSPACE

Training algorithm directory

WORKSPACE=/home/ma-user/modelarts/user-job-dir/code

Table 2 Environment variables of a distributed training job

Variable

Description

Example

MA_CURRENT_IP

IP address of a job container.

MA_CURRENT_IP=192.168.23.38

MA_NUM_GPUS

Number of accelerator cards in a job container.

MA_NUM_GPUS=8

MA_TASK_NAME

Name of a job container, for example:

  • worker in MindSpore and PyTorch
  • learner or worker in reinforcement learning engines
  • ps or worker in TensorFlow

MA_TASK_NAME=worker

MA_NUM_HOSTS

Number of instances which is automatically obtained from Compute Nodes.

MA_NUM_HOSTS=4

VC_TASK_INDEX

Container index, starting from 0. This parameter is invalid for single-node training. In multi-node training jobs, you can use this parameter to determine the algorithm logic of the container.

VC_TASK_INDEX=0

VC_WORKER_NUM

Instances required for a training job.

VC_WORKER_NUM=4

VC_WORKER_HOSTS

Domain name of each node for multi-node training. Use commas (,) to separate the domain names in sequence. You can obtain the IP address through domain name resolution.

VC_WORKER_HOSTS=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-0.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1.ob-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-2.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,ob-a0978141-1712-4f9b-8a83-000000000000-worker-3.modelarts-job-a0978141-1712-4f9b-8a83-000000000000

${MA_VJ_NAME}-${MA_TASK_NAME}-N.${MA_VJ_NAME}

Communication domain name of a node. For example, the communication domain name of node 0 is ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}.

N indicates the number of instances.

WARNING:

This method does not work for creating communication domain names for supernode resource pools.

Instead, you can get the communication domain names for all nodes directly from VC_WORKER_HOSTS across all resource pools, including supernode ones.

For example, if there are four instances, the environment variables are as follows:

${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-1.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-2.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-3.${MA_VJ_NAME}

Table 3 NCCL environment variables

Variable

Description

Example

NCCL_VERSION

NCCL version

NCCL_VERSION=2.7.8

NCCL_DEBUG

NCCL log level

NCCL_DEBUG=INFO

NCCL_IB_HCA

InfiniBand NIC to use for communication

NCCL_IB_HCA=^mlx5_bond_0

NCCL_IB_TIMEOUT

InfiniBand transmission timeout interval

NCCL_IB_TIMEOUT=18

NCCL_IB_RETRY_CNT

Maximum number of InfiniBand transmission retries

NCCL_IB_RETRY_CNT=15

NCCL_IB_GID_INDEX

Global ID index used in RoCE mode

NCCL_IB_GID_INDEX=3

NCCL_IB_TC

InfiniBand traffic type

NCCL_IB_TC=128

NCCL_SOCKET_IFNAME

IP interface to use for communication

NCCL_SOCKET_IFNAME=bond0,eth0

NCCL_NET_PLUGIN

Network plug-in used by NCCL

NCCL_NET_PLUGIN=none

Table 4 OBS environment variables

Variable

Description

Example

MA_S3_ENDPOINT

OBS endpoint

N/A

S3_VERIFY_SSL

Whether to use SSL to access OBS

S3_VERIFY_SSL=0

S3_USE_HTTPS

Whether to use HTTPS to access OBS

S3_USE_HTTPS=1

Table 5 Environment variables of the pip source and API Gateway address

Variable

Description

Example

MA_PIP_HOST

Domain name of the pip source

MA_PIP_HOST=repo.example.com

MA_PIP_URL

Address of the pip source

MA_PIP_URL=http://repo.example.com/repository/pypi/simple/

MA_APIGW_ENDPOINT

ModelArts API Gateway address

MA_APIGW_ENDPOINT=https://modelarts.region.xxx.example.com

Table 6 Environment variables of job metadata

Variable

Description

Example

MA_CURRENT_INSTANCE_NAME

Name of the current node for multi-node training

MA_CURRENT_INSTANCE_NAME=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1

Table 7 Precheck environment variables

Variable

Description

Example

MA_SKIP_IMAGE_DETECT

Whether to enable ModelArts precheck. The default value is 1, which indicates that the pre-check is enabled; the value 0 indicates that the pre-check is disabled.

It is good practice to enable precheck to detect node and driver faults before they affect services.

1

Table 8 Suspension detection environment variables

Variable

Description

Example

MA_HANG_DETECT_TIME

Suspension detection time. The job is considered suspended if its process I/O does not change for this time.

Value range: 10 to 720

Unit: minute

Default value: 30

30

How Do I View Training Environment Variables?

Environment variables may be injected into the container or processes, depending on the service. Environment variables injected into processes are invisible in Cloud Shell.

You are advised to use method 1 to view all environment variables.

  1. View training environment variables using the boot command.

    When creating a training job, select Custom algorithm for Algorithm Type and Custom image for Boot Mode, enter env for Boot Command, and retain default settings for other parameters.

    Figure 1 Boot Command

    After the training job is complete, check the Logs tab on the training job details page. The logs contain information about all environment variables.

    Figure 2 Viewing logs

  2. Check training environment variables using Cloud Shell.

    Run the env command on Cloud Shell to obtain the environment variables.

    This method fails to obtain environment variables injected by the training platform during processes, such as VC_TASK_INDEX, VC_WORKER_NUM, and VC_WORKER_HOSTS in supernode scenarios. This method is not recommended.