Viewing Environment Variables of a Training Container
What Is an Environment Variable
This section describes environment variables preset in a training container. The environment variables include:
- Path environment variables
- Environment variables of a distributed training job
- Nvidia Collective multi-GPU Communication Library (NCCL) environment variables
- OBS environment variables
- Environment variables of the PIP source
- Environment variables of the API Gateway address
- Environment variables of job metadata
Configuring Environment Variables
When you create a training job, you can add environment variables or modify environment variables preset in the training container.
Environment Variables Preset in a Training Container
The following tables list environment variables preset in a training container, including Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, and Table 7.
The environment variable values are examples.
Variable |
Description |
Example |
---|---|---|
PATH |
Executable file paths |
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin |
LD_LIBRARY_PATH |
Dynamic load library paths |
LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/usr/local/cuda/lib64:/usr/local/cuda/compat:/root/miniconda3/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 |
LIBRARY_PATH |
Static library paths |
LIBRARY_PATH=/usr/local/cuda/lib64/stubs |
MA_HOME |
Main directory of a training job |
MA_HOME=/home/ma-user |
MA_JOB_DIR |
Parent directory of the training algorithm folder |
MA_JOB_DIR=/home/ma-user/modelarts/user-job-dir |
MA_MOUNT_PATH |
Path mounted to a ModelArts training container, which is used to temporarily store training algorithms, algorithm input, algorithm output, and logs |
MA_MOUNT_PATH=/home/ma-user/modelarts |
MA_LOG_DIR |
Training log directory |
MA_LOG_DIR=/home/ma-user/modelarts/log |
MA_SCRIPT_INTERPRETER |
Training script interpreter |
MA_SCRIPT_INTERPRETER= |
WORKSPACE |
Training algorithm directory |
WORKSPACE=/home/ma-user/modelarts/user-job-dir/code |
Variable |
Description |
Example |
---|---|---|
MA_CURRENT_IP |
IP address of a job container. |
MA_CURRENT_IP=192.168.23.38 |
MA_NUM_GPUS |
Number of accelerator cards in a job container. |
MA_NUM_GPUS=8 |
MA_TASK_NAME |
Name of a job container, for example:
|
MA_TASK_NAME=worker |
MA_NUM_HOSTS |
Compute nodes required for a training job. |
MA_NUM_HOSTS=4 |
VC_TASK_INDEX |
Sequence number of a job container for multi-node training. The value of the first container is 0. |
VC_TASK_INDEX=0 |
VC_WORKER_NUM |
Compute nodes required for a training job. |
VC_WORKER_NUM=4 |
VC_WORKER_HOSTS |
Domain name of each node for multi-node training. Use commas (,) to separate the domain names in sequence. You can obtain the IP address through domain name resolution. |
VC_WORKER_HOSTS=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-0.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1.ob-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-2.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,ob-a0978141-1712-4f9b-8a83-000000000000-worker-3.modelarts-job-a0978141-1712-4f9b-8a83-000000000000 |
Variable |
Description |
Example |
---|---|---|
NCCL_VERSION |
NCCL version |
NCCL_VERSION=2.7.8 |
NCCL_DEBUG |
NCCL log level |
NCCL_DEBUG=INFO |
NCCL_IB_HCA |
InfiniBand NIC to use for communication |
NCCL_IB_HCA=^mlx5_bond_0 |
NCCL_SOCKET_IFNAME |
IP interface to use for communication |
NCCL_SOCKET_IFNAME=bond0,eth0 |
Variable |
Description |
Example |
---|---|---|
S3_ENDPOINT |
OBS endpoint |
- |
S3_VERIFY_SSL |
Whether to use SSL to access OBS |
S3_VERIFY_SSL=0 |
S3_USE_HTTPS |
Whether to use HTTPS to access OBS |
S3_USE_HTTPS=1 |
Variable |
Description |
Example |
---|---|---|
MA_PIP_HOST |
Domain name of the PIP source |
MA_PIP_HOST=repo.myhuaweicloud.com |
MA_PIP_URL |
Address of the PIP source |
MA_PIP_URL=http://repo.myhuaweicloud.com/repository/pypi/simple/ |
MA_APIGW_ENDPOINT |
ModelArts API Gateway address |
MA_APIGW_ENDPOINT=https://modelarts.region.cn-east-3.myhuaweicloud.com |
Variable |
Description |
Example |
---|---|---|
MA_CURRENT_INSTANCE_NAME |
Name of the current node for multi-node training |
MA_CURRENT_INSTANCE_NAME=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1 |
Variable |
Description |
Example |
---|---|---|
MA_SKIP_IMAGE_DETECT |
Whether to enable ModelArts precheck. The default value is 1, which indicates that the pre-check is enabled; the value 0 indicates that the pre-check is disabled. It is a good practice to enable precheck to detect node and driver faults before they affect services. |
1 |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot