Help Center/ ModelArts/ Model Development/ Performing a Training/ Viewing Environment Variables of a Training Container
Updated on 2024-07-25 GMT+08:00

Viewing Environment Variables of a Training Container

What Is an Environment Variable

This section describes environment variables preset in a training container. The environment variables include:

  • Path environment variables
  • Environment variables of a distributed training job
  • Nvidia Collective multi-GPU Communication Library (NCCL) environment variables
  • OBS environment variables
  • Environment variables of the PIP source
  • Environment variables of the API Gateway address
  • Environment variables of job metadata

Configuring Environment Variables

When you create a training job, you can add environment variables or modify environment variables preset in the training container.

Environment Variables Preset in a Training Container

The following tables list environment variables preset in a training container, including Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, and Table 7.

The environment variable values are examples.

Table 1 Path environment variables

Variable

Description

Example

PATH

Executable file paths

PATH=/usr/local/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

LD_LIBRARY_PATH

Dynamic load library paths

LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/usr/local/cuda/lib64:/usr/local/cuda/compat:/root/miniconda3/lib:/usr/local/lib:/usr/local/nvidia/lib64

LIBRARY_PATH

Static library paths

LIBRARY_PATH=/usr/local/cuda/lib64/stubs

MA_HOME

Main directory of a training job

MA_HOME=/home/ma-user

MA_JOB_DIR

Parent directory of the training algorithm folder

MA_JOB_DIR=/home/ma-user/modelarts/user-job-dir

MA_MOUNT_PATH

Path mounted to a ModelArts training container, which is used to temporarily store training algorithms, algorithm input, algorithm output, and logs

MA_MOUNT_PATH=/home/ma-user/modelarts

MA_LOG_DIR

Training log directory

MA_LOG_DIR=/home/ma-user/modelarts/log

MA_SCRIPT_INTERPRETER

Training script interpreter

MA_SCRIPT_INTERPRETER=

WORKSPACE

Training algorithm directory

WORKSPACE=/home/ma-user/modelarts/user-job-dir/code

Table 2 Environment variables of a distributed training job

Variable

Description

Example

MA_CURRENT_IP

IP address of a job container.

MA_CURRENT_IP=192.168.23.38

MA_NUM_GPUS

Number of accelerator cards in a job container.

MA_NUM_GPUS=8

MA_TASK_NAME

Name of a job container, for example:

  • worker in MindSpore and PyTorch.
  • learner or worker in reinforcement learning engines.
  • ps or worker in TensorFlow.

MA_TASK_NAME=worker

MA_NUM_HOSTS

Number of compute nodes, which is automatically obtained from Compute Nodes.

MA_NUM_HOSTS=4

VC_TASK_INDEX

Container index, starting from 0. This parameter is invalid for single-node training. In multi-node training jobs, you can use this parameter to determine the algorithm logic of the container.

VC_TASK_INDEX=0

VC_WORKER_NUM

Compute nodes required for a training job.

VC_WORKER_NUM=4

VC_WORKER_HOSTS

Domain name of each node for multi-node training. Use commas (,) to separate the domain names in sequence. You can obtain the IP address through domain name resolution.

VC_WORKER_HOSTS=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-0.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1.ob-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-2.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,ob-a0978141-1712-4f9b-8a83-000000000000-worker-3.modelarts-job-a0978141-1712-4f9b-8a83-000000000000

${MA_VJ_NAME}-${MA_TASK_NAME}-N.${MA_VJ_NAME}

Communication domain name of a node. For example, the communication domain name of node 0 is ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}.

N indicates the number of compute nodes.

For example, if there are four compute nodes, the environment variables are as follows:

${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-1.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-2.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-3.${MA_VJ_NAME}

Table 3 NCCL environment variables

Variable

Description

Example

NCCL_VERSION

NCCL version

NCCL_VERSION=2.7.8

NCCL_DEBUG

NCCL log level

NCCL_DEBUG=INFO

NCCL_IB_HCA

InfiniBand NIC to use for communication

NCCL_IB_HCA=^mlx5_bond_0

NCCL_SOCKET_IFNAME

IP interface to use for communication

NCCL_SOCKET_IFNAME=bond0,eth0

Table 4 OBS environment variables

Variable

Description

Example

S3_ENDPOINT

OBS endpoint

-

S3_VERIFY_SSL

Whether to use SSL to access OBS

S3_VERIFY_SSL=0

S3_USE_HTTPS

Whether to use HTTPS to access OBS

S3_USE_HTTPS=1

Table 5 Environment variables of the PIP source and API Gateway address

Variable

Description

Example

MA_PIP_HOST

Domain name of the PIP source

MA_PIP_HOST=repo.myhuaweicloud.com

MA_PIP_URL

Address of the PIP source

MA_PIP_URL=http://repo.myhuaweicloud.com/repository/pypi/simple/

MA_APIGW_ENDPOINT

ModelArts API Gateway address

MA_APIGW_ENDPOINT=https://modelarts.region.cn-east-3.myhuaweicloud.com

Table 6 Environment variables of job metadata

Variable

Description

Example

MA_CURRENT_INSTANCE_NAME

Name of the current node for multi-node training

MA_CURRENT_INSTANCE_NAME=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1

Table 7 Precheck environment variables

Variable

Description

Example

MA_SKIP_IMAGE_DETECT

Whether to enable ModelArts precheck. The default value is 1, which indicates that the pre-check is enabled; the value 0 indicates that the pre-check is disabled.

It is a good practice to enable precheck to detect node and driver faults before they affect services.

1