Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Developing Code for Training Using a Custom Image

Updated on 2024-12-26 GMT+08:00

If the preset images offered by ModelArts Standard do not meet your needs, create custom images for model training.

Customizing an image requires a deep understanding of containers. Use this method only if the subscribed algorithms and preset images cannot meet your requirements. Custom images can be used to train models in ModelArts Standard only after they are uploaded to the Software Repository for Container (SWR).

Boot Command Specifications for Custom Images

You can create an image based on the ModelArts image specifications, select your own image and configure the code directory (optional) and boot command to create a training job.

Figure 1 Selecting a custom image
NOTE:

When you use a custom image to create a training job, the boot command must be executed in the /home/ma-user directory. Otherwise, the training job may run abnormally.

conda env starts training jobs created using custom images. Training jobs do not run in a shell. Therefore, you are not allowed to run the conda activate command to activate a specified Conda environment. In this case, use other methods to start training. For example, Conda in your custom image is installed in the /home/ma-user/anaconda3 directory, the Conda environment is python-3.7.10, and the training script is stored in /home/ma-user/modelarts/user-job-dir/code/train.py. Use a specified Conda environment to start training in one of the following ways:

  • Method 1: Configure the correct DEFAULT_CONDA_ENV_NAME and ANACONDA_DIR environment variables for the image.
    ANACONDA_DIR=/home/ma-user/anaconda3
    DEFAULT_CONDA_ENV_NAME=python-3.7.10
    Run the python command to start the training script. The following shows an example:
    python /home/ma-user/modelarts/user-job-dir/code/train.py
  • Method 2: Use the absolute path of Conda environment Python.
    Run the /home/ma-user/anaconda3/envs/python-3.7.10/bin/python command to start the training script. The following shows an example:
    /home/ma-user/anaconda3/envs/python-3.7.10/bin/python /home/ma-user/modelarts/user-job-dir/code/train.py
  • Method 3: Configure the PATH environment variable.
    Configure the bin directory of the specified Conda environment into the path environment variable. Run the python command to start the training script. The following shows an example:
    export PATH=/home/ma-user/anaconda3/envs/python-3.7.10/bin:$PATH; python /home/ma-user/modelarts/user-job-dir/code/train.py
  • Method 4: Run the conda run -n command.
    Run the /home/ma-user/anaconda3/bin/conda run -n python-3.7.10 command to execute the training. The following shows an example:
    /home/ma-user/anaconda3/bin/conda run -n python-3.7.10 python /home/ma-user/modelarts/user-job-dir/code/train.py
NOTE:

If there is an error indicating that the .so file is unavailable in the $ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib directory, add the directory to LD_LIBRARY_PATH and place the following command before the preceding boot command:

export LD_LIBRARY_PATH=$ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib:$LD_LIBRARY_PATH;

For example, the example boot command used in method 1 is as follows:

export LD_LIBRARY_PATH=$ANACONDA_DIR/envs/$DEFAULT_CONDA_ENV_NAME/lib:$LD_LIBRARY_PATH; python /home/ma-user/modelarts/user-job-dir/code/train.py

Training Code Adaptation Specifications for Training Using an Ascend-powered Custom Image

When creating a training job that uses NPU resources, the system automatically generates the Ascend HCCL RANK_TABLE_FILE file in the training container. When using a preset image, Ascend HCCL RANK_TABLE_FILE is automatically parsed during training. When using a custom image, the training code must be modified to read and parse Ascend HCCL RANK_TABLE_FILE.

Ascend HCCL RANK_TABLE_FILE file description

Ascend HCCL RANK_TABLE_FILE provides the cluster used by Ascend distributed training jobs. It is used for distributed communication between Ascend chips and can be parsed by Huawei Collective Communication Library (HCCL). The file has two format versions: template 1 and template 2.

  • ModelArts provides the template 2 format. The Ascend HCCL RANK_TABLE_FILE file in the ModelArts training environment is named jobstart_hccl.json. You can access this file using the preset RANK_TABLE_FILE environment variable.
    Table 1 RANK_TABLE_FILE environment variables

    Environment Variable

    Description

    RANK_TABLE_FILE

    Directory of Ascend HCCL RANK_TABLE_FILE, which is /user/config.

    Obtain the file using ${RANK_TABLE_FILE}/jobstart_hccl.json.

    Example of the jobstart_hccl.json file content in the ModelArts training environment (template 2):
    {
    	"group_count": "1",
    	"group_list": [{
    		"device_count": "1",
    		"group_name": "job-trainjob",
    		"instance_count": "1",
    		"instance_list": [{
    			"devices": [{
    				"device_id": "4",
    				"device_ip": "192.1.10.254"
    			}],
    			"pod_name": "jobxxxxxxxx-job-trainjob-0",
    			"server_id": "192.168.0.25"
    		}]
    	}],
    	"status": "completed"
    }

    In jobstart_hccl.json, the status value may not be completed when the training script is started. In this case, wait until the status value changes to completed and read the remaining content of the file.

  • After the status field is completed, use the training script to convert the jobstart_hccl.json file from template 2 to template 1 format.
    Format of the jobstart_hccl.json file after format conversion (template 1):
    {
    	"server_count": "1",
    	"server_list": [{
    		"device": [{
    			"device_id": "4",
    			"device_ip": "192.1.10.254",
    			"rank_id": "0"
    		}],
    		"server_id": "192.168.0.25"
    	}],
    	"status": "completed",
    	"version": "1.0"
    }

Mount Points of a Training Job in a Container

When training a model with a custom image, the mount points in the container are shown in Table 2.

Table 2 Training job mount points

Mount Point

Read Only

Remarks

/xxx

No

Directory where a dedicated resource pool mounts an SFS disk. You can specify this directory.

/home/ma-user/modelarts

No

This folder is empty. You should use it as the main directory.

/cache

No

Used to mount the hard disk of the host NVMe (supported by bare metal specifications).

/dev/shm

No

Used for PyTorch engine acceleration

/usr/local/nvidia

Yes

NVIDIA library of the host machine.

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback