Viewing Environment Variables of a Training Container

What Is an Environment Variable

This section describes environment variables preset in a training container. The environment variables include:

Path environment variables
Environment variables of a distributed training job
Nvidia Collective multi-GPU Communication Library (NCCL) environment variables
OBS environment variables
Environment variables of the PIP source
Environment variables of the API Gateway address
Environment variables of job metadata

Configuring Environment Variables

When you create a training job, you can add environment variables or modify environment variables preset in the training container.

Environment Variables Preset in a Training Container

The following tables list environment variables preset in a training container, including Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, and Table 7.

The environment variable values are examples.

**Table 1** Path environment variables
Variable	Description	Example
PATH	Executable file paths	PATH=/usr/local/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
LD_LIBRARY_PATH	Dynamic load library paths	LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/usr/local/cuda/lib64:/usr/local/cuda/compat:/root/miniconda3/lib:/usr/local/lib:/usr/local/nvidia/lib64
LIBRARY_PATH	Static library paths	LIBRARY_PATH=/usr/local/cuda/lib64/stubs
MA_HOME	Main directory of a training job	MA_HOME=/home/ma-user
MA_JOB_DIR	Parent directory of the training algorithm folder	MA_JOB_DIR=/home/ma-user/modelarts/user-job-dir
MA_MOUNT_PATH	Path mounted to a ModelArts training container, which is used to temporarily store training algorithms, algorithm input, algorithm output, and logs	MA_MOUNT_PATH=/home/ma-user/modelarts
MA_LOG_DIR	Training log directory	MA_LOG_DIR=/home/ma-user/modelarts/log
MA_SCRIPT_INTERPRETER	Training script interpreter	MA_SCRIPT_INTERPRETER=
WORKSPACE	Training algorithm directory	WORKSPACE=/home/ma-user/modelarts/user-job-dir/code

**Table 2** Environment variables of a distributed training job
Variable	Description	Example
MA_CURRENT_IP	IP address of a job container.	MA_CURRENT_IP=192.168.23.38
MA_NUM_GPUS	Number of accelerator cards in a job container.	MA_NUM_GPUS=8
MA_TASK_NAME	Name of a job container, for example: worker in MindSpore and PyTorch. learner or worker in reinforcement learning engines. ps or worker in TensorFlow.	MA_TASK_NAME=worker
MA_NUM_HOSTS	Number of compute nodes, which is automatically obtained from Compute Nodes.	MA_NUM_HOSTS=4
VC_TASK_INDEX	Container index, starting from 0. This parameter is invalid for single-node training. In multi-node training jobs, you can use this parameter to determine the algorithm logic of the container.	VC_TASK_INDEX=0
VC_WORKER_NUM	Compute nodes required for a training job.	VC_WORKER_NUM=4
VC_WORKER_HOSTS	Domain name of each node for multi-node training. Use commas (,) to separate the domain names in sequence. You can obtain the IP address through domain name resolution.	VC_WORKER_HOSTS=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-0.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1.ob-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-2.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,ob-a0978141-1712-4f9b-8a83-000000000000-worker-3.modelarts-job-a0978141-1712-4f9b-8a83-000000000000
${MA_VJ_NAME}-${MA_TASK_NAME}-N.${MA_VJ_NAME}	Communication domain name of a node. For example, the communication domain name of node 0 is ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}. N indicates the number of compute nodes.	For example, if there are four compute nodes, the environment variables are as follows: ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME} ${MA_VJ_NAME}-${MA_TASK_NAME}-1.${MA_VJ_NAME} ${MA_VJ_NAME}-${MA_TASK_NAME}-2.${MA_VJ_NAME} ${MA_VJ_NAME}-${MA_TASK_NAME}-3.${MA_VJ_NAME}

**Table 3** NCCL environment variables
Variable	Description	Example
NCCL_VERSION	NCCL version	NCCL_VERSION=2.7.8
NCCL_DEBUG	NCCL log level	NCCL_DEBUG=INFO
NCCL_IB_HCA	InfiniBand NIC to use for communication	NCCL_IB_HCA=^mlx5_bond_0
NCCL_SOCKET_IFNAME	IP interface to use for communication	NCCL_SOCKET_IFNAME=bond0,eth0

**Table 4** OBS environment variables
Variable	Description	Example
S3_ENDPOINT	OBS endpoint	-
S3_VERIFY_SSL	Whether to use SSL to access OBS	S3_VERIFY_SSL=0
S3_USE_HTTPS	Whether to use HTTPS to access OBS	S3_USE_HTTPS=1

**Table 5** Environment variables of the PIP source and API Gateway address
Variable	Description	Example
MA_PIP_HOST	Domain name of the PIP source	MA_PIP_HOST=repo.myhuaweicloud.com
MA_PIP_URL	Address of the PIP source	MA_PIP_URL=http://repo.myhuaweicloud.com/repository/pypi/simple/
MA_APIGW_ENDPOINT	ModelArts API Gateway address	MA_APIGW_ENDPOINT=https://modelarts.region.cn-east-3.myhuaweicloud.com

**Table 6** Environment variables of job metadata
Variable	Description	Example
MA_CURRENT_INSTANCE_NAME	Name of the current node for multi-node training	MA_CURRENT_INSTANCE_NAME=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1

**Table 7** Precheck environment variables
Variable	Description	Example
MA_SKIP_IMAGE_DETECT	Whether to enable ModelArts precheck. The default value is 1, which indicates that the pre-check is enabled; the value 0 indicates that the pre-check is disabled. It is a good practice to enable precheck to detect node and driver faults before they affect services.	1