Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ ModelArts/ ModelArts User Guide (Standard)/ Model Training/ Managing Model Training Jobs/ Managing Environment Variables of a Training Container

Managing Environment Variables of a Training Container

Updated on 2024-12-26 GMT+08:00

What Is an Environment Variable

This section describes environment variables preset in a training container. The environment variables include:

  • Path environment variables
  • Environment variables of a distributed training job
  • Nvidia Collective multi-GPU Communication Library (NCCL) environment variables
  • OBS environment variables
  • Environment variables of the pip source
  • Environment variables of the API Gateway address
  • Environment variables of job metadata

Notes and Constraints

When defining custom environment variables, avoid using names that start with MA_ to prevent conflicts with system environment variables.

Configuring Environment Variables

When you create a training job, you can add environment variables or modify environment variables preset in the training container.

NOTE:

To ensure data security, do not enter sensitive information, such as plaintext passwords.

Environment Variables Preset in a Training Container

Table 1, Table 2, Table 3, Table 4, Table 5, Table 6, and Table 7 list environment variables preset in a training container.

The environment variable values are examples only.

Table 1 Path environment variables

Variable

Description

Example

PATH

Executable file paths

PATH=/usr/local/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

LD_LIBRARY_PATH

Dynamic load library paths

LD_LIBRARY_PATH=/usr/local/seccomponent/lib:/usr/local/cuda/lib64:/usr/local/cuda/compat:/root/miniconda3/lib:/usr/local/lib:/usr/local/nvidia/lib64

LIBRARY_PATH

Static library paths

LIBRARY_PATH=/usr/local/cuda/lib64/stubs

MA_HOME

Main directory of a training job

MA_HOME=/home/ma-user

MA_JOB_DIR

Parent directory of the training algorithm folder

MA_JOB_DIR=/home/ma-user/modelarts/user-job-dir

MA_MOUNT_PATH

Path mounted to a ModelArts training container, which is used to temporarily store training algorithms, algorithm input, algorithm output, and logs

MA_MOUNT_PATH=/home/ma-user/modelarts

MA_LOG_DIR

Training log directory

MA_LOG_DIR=/home/ma-user/modelarts/log

MA_SCRIPT_INTERPRETER

Training script interpreter

MA_SCRIPT_INTERPRETER=

WORKSPACE

Training algorithm directory

WORKSPACE=/home/ma-user/modelarts/user-job-dir/code

Table 2 Environment variables of a distributed training job

Variable

Description

Example

MA_CURRENT_IP

IP address of a job container.

MA_CURRENT_IP=192.168.23.38

MA_NUM_GPUS

Number of accelerator cards in a job container.

MA_NUM_GPUS=8

MA_TASK_NAME

Name of a job container, for example:

  • worker in MindSpore and PyTorch
  • learner or worker in reinforcement learning engines
  • ps or worker in TensorFlow

MA_TASK_NAME=worker

MA_NUM_HOSTS

Number of instances which is automatically obtained from Compute Nodes.

MA_NUM_HOSTS=4

VC_TASK_INDEX

Container index, starting from 0. This parameter is invalid for single-node training. In multi-node training jobs, you can use this parameter to determine the algorithm logic of the container.

VC_TASK_INDEX=0

VC_WORKER_NUM

Instances required for a training job.

VC_WORKER_NUM=4

VC_WORKER_HOSTS

Domain name of each node for multi-node training. Use commas (,) to separate the domain names in sequence. You can obtain the IP address through domain name resolution.

VC_WORKER_HOSTS=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-0.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1.ob-a0978141-1712-4f9b-8a83-000000000000,modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-2.modelarts-job-a0978141-1712-4f9b-8a83-000000000000,ob-a0978141-1712-4f9b-8a83-000000000000-worker-3.modelarts-job-a0978141-1712-4f9b-8a83-000000000000

${MA_VJ_NAME}-${MA_TASK_NAME}-N.${MA_VJ_NAME}

Communication domain name of a node. For example, the communication domain name of node 0 is ${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}.

N indicates the number of instances.

For example, if there are four instances, the environment variables are as follows:

${MA_VJ_NAME}-${MA_TASK_NAME}-0.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-1.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-2.${MA_VJ_NAME}

${MA_VJ_NAME}-${MA_TASK_NAME}-3.${MA_VJ_NAME}

Table 3 NCCL environment variables

Variable

Description

Example

NCCL_VERSION

NCCL version

NCCL_VERSION=2.7.8

NCCL_DEBUG

NCCL log level

NCCL_DEBUG=INFO

NCCL_IB_HCA

InfiniBand NIC to use for communication

NCCL_IB_HCA=^mlx5_bond_0

NCCL_SOCKET_IFNAME

IP interface to use for communication

NCCL_SOCKET_IFNAME=bond0,eth0

Table 4 OBS environment variables

Variable

Description

Example

S3_ENDPOINT

OBS endpoint

N/A

S3_VERIFY_SSL

Whether to use SSL to access OBS

S3_VERIFY_SSL=0

S3_USE_HTTPS

Whether to use HTTPS to access OBS

S3_USE_HTTPS=1

Table 5 Environment variables of the pip source and API Gateway address

Variable

Description

Example

MA_PIP_HOST

Domain name of the pip source

MA_PIP_HOST=repo.myhuaweicloud.com

MA_PIP_URL

Address of the pip source

MA_PIP_URL=http://repo.myhuaweicloud.com/repository/pypi/simple/

MA_APIGW_ENDPOINT

ModelArts API Gateway address

MA_APIGW_ENDPOINT=https://modelarts.region.cn-east-3.myhuaweicloud.com

Table 6 Environment variables of job metadata

Variable

Description

Example

MA_CURRENT_INSTANCE_NAME

Name of the current node for multi-node training

MA_CURRENT_INSTANCE_NAME=modelarts-job-a0978141-1712-4f9b-8a83-000000000000-worker-1

Table 7 Precheck environment variables

Variable

Description

Example

MA_SKIP_IMAGE_DETECT

Whether to enable ModelArts precheck. The default value is 1, which indicates that the pre-check is enabled; the value 0 indicates that the pre-check is disabled.

It is good practice to enable precheck to detect node and driver faults before they affect services.

1

Table 8 Suspension detection environment variables

Variable

Description

Example

MA_HANG_DETECT_TIME

Suspension detection time. The job is considered suspended if its process I/O does not change for this time.

Value range: 10 to 720

Unit: minute

Default value: 30

30

How Do I View Training Environment Variables?

When creating a training job, set the boot command to env and retain default settings of other parameters.

After the training job is complete, view the Logs tab on the training job details page. The logs contain information about all environment variables.

Figure 1 Viewing logs

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback