Viewing Environment Variables of a Training Container
What Is an Environment Variable
This section describes environment variables preset in a training container. The environment variables include:
- Path environment variables
- Environment variables of a distributed training job
- Nvidia Collective multi-GPU Communication Library (NCCL) environment variables
- OBS environment variables
- Environment variables of the PIP source
- Environment variables of the API Gateway address
- Environment variables of job metadata
Configuring Environment Variables
When you create a training job, you can add environment variables or modify environment variables preset in the training container.
Environment Variables Preset in a Training Container
The following tables list environment variables preset in a training container, including Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, and Table 7.
The environment variable values are examples.
Variable |
Description |
Example |
---|---|---|
PATH |
Executable file paths |
PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin |
LD_LIBRARY_PATH |
Dynamic load library paths |
LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/usr/local/cuda/lib64:/usr/local/cuda/compat:/root/miniconda3/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 |
LIBRARY_PATH |
Static library paths |
LIBRARY_PATH=/usr/local/cuda/lib64/stubs |
MA_HOME |
Main directory of a training job |
MA_HOME=/home/ma-user |
MA_JOB_DIR |
Parent directory of the training algorithm folder |
MA_JOB_DIR=/home/ma-user/modelarts/user-job-dir |
MA_MOUNT_PATH |
Path mounted to a ModelArts training container, which is used to temporarily store training algorithms, algorithm input, algorithm output, and logs |
MA_MOUNT_PATH=/home/ma-user/modelarts |
MA_LOG_DIR |
Training log directory |
MA_LOG_DIR=/home/ma-user/modelarts/log |
MA_SCRIPT_INTERPRETER |
Training script interpreter |
MA_SCRIPT_INTERPRETER= |
WORKSPACE |
Training algorithm directory |
WORKSPACE=/home/ma-user/modelarts/user-job-dir/code |
Variable |
Description |
Example |
---|---|---|
MA_CURRENT_IP |
IP address of a job container. |
MA_CURRENT_IP=192.168.23.38 |
MA_NUM_GPUS |
Number of accelerator cards in a job container. |
MA_NUM_GPUS=8 |
MA_TASK_NAME |
Name of a job container, for example:
|
MA_TASK_NAME=worker |
MA_NUM_HOSTS |
Compute nodes required for a training job. |
MA_NUM_HOSTS=4 |
VC_TASK_INDEX |
Sequence number of a job container for multi-node training. The value of the first container is 0. |
VC_TASK_INDEX=0 |
VC_WORKER_NUM |
Compute nodes required for a training job. |
VC_WORKER_NUM=4 |
VC_WORKER_HOSTS |
Domain name of each node for multi-node training. Use commas (,) to separate the domain names in sequence. You can obtain the IP address through domain name resolution. |
VC_WORKER_HOSTS=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-0.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1.ob-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-2.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,ob-a0978141-1712-4f9b-8a83-000000000000-worker-3.modelarts-job-a0978141-1712-4f9b-8a83-000000000000 |
Variable |
Description |
Example |
---|---|---|
NCCL_VERSION |
NCCL version |
NCCL_VERSION=2.7.8 |
NCCL_DEBUG |
NCCL log level |
NCCL_DEBUG=INFO |
NCCL_IB_HCA |
InfiniBand NIC to use for communication |
NCCL_IB_HCA=^mlx5_bond_0 |
NCCL_SOCKET_IFNAME |
IP interface to use for communication |
NCCL_SOCKET_IFNAME=bond0,eth0 |
Variable |
Description |
Example |
---|---|---|
S3_ENDPOINT |
OBS endpoint |
S3_ENDPOINT=https://obs.region.example.com |
S3_VERIFY_SSL |
Whether to use SSL to access OBS |
S3_VERIFY_SSL=0 |
S3_USE_HTTPS |
Whether to use HTTPS to access OBS |
S3_USE_HTTPS=1 |
Variable |
Description |
Example |
---|---|---|
MA_PIP_HOST |
Domain name of the PIP source |
MA_PIP_HOST=repo.example.com |
MA_PIP_URL |
Address of the PIP source |
MA_PIP_URL=http://repo.example.com/repository/pypi/simple/ |
MA_APIGW_ENDPOINT |
ModelArts API Gateway address |
MA_APIGW_ENDPOINT=https://modelarts.region.example.example.com |
Variable |
Description |
Example |
---|---|---|
MA_CURRENT_INSTANCE_NAME |
Name of the current node for multi-node training |
MA_CURRENT_INSTANCE_NAME=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1 |
Variable |
Description |
Example |
---|---|---|
MA_SKIP_IMAGE_DETECT |
Whether to enable ModelArts precheck. The default value is 1, which indicates that the pre-check is enabled; the value 0 indicates that the pre-check is disabled. It is a good practice to enable precheck to detect node and driver faults before they affect services. |
1 |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.