Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Viewing Training Job Logs

Updated on 2024-12-26 GMT+08:00

Overview

Training logs record the runtime process and exception information of training jobs and provide useful details for fault location. The standard output and standard error information in your code are displayed in training logs. If you encounter an issue during the execution of a ModelArts training job, view logs first. In most scenarios, you can locate the issue based on the error information reported in logs.

Training logs include common training logs and Ascend logs.

  • Common Logs: When resources other than Ascend are used for training, only common training logs are generated. Common logs include the logs for pip-requirement.txt, training process, and ModelArts.
  • Ascend Logs: When Ascend resources are used for training, device logs, plog logs, proc log for single-card training logs, MindSpore logs, and common logs are generated.
Figure 1 ModelArts training logs
NOTE:

Separate MindSpore logs are generated only in the MindSpore+Ascend training scenario. Logs of other AI engines are contained in common logs.

Retention Period

Logs are classified into the following types based on the retention period:

  • Real-time logs: generated during training job running and can be viewed on the ModelArts training job details page.
  • Historical logs: After a training job is completed, you can view its historical logs on the ModelArts training job details page. ModelArts automatically stores the logs for 30 days.
  • Permanent logs: These logs are dumped to your OBS bucket. When creating a training job, you can enable persistent log saving and set a job log path for dumping.
    Figure 2 Enabling persistent log saving

Real-time logs and historical logs have no difference in content. In the Ascend training scenario, permanent logs contain Ascend logs, which are not displayed on ModelArts.

Common Logs

Common logs include the logs for pip-requirement.txt, training process, and ModelArts Standard.

Table 1 Common log types

Type

Description

Training process log

Standard output of your training code.

Installation logs for pip-requirement.txt

If pip-requirement.txt is defined in training code, pip package installation logs are generated.

ModelArts logs

ModelArts logs are used by O&M personnel to locate service faults.

The format of a common log file is as follows. task id is the node ID of a training job.

Unified log format: modelarts-job-[job id]-[task id].log
Example: log/modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-worker-0.log
  • Single-node training jobs generate a log file, and task id defaults to worker-0.
  • Distributed training generates multiple node log files, which are distinguished by task id, such as worker-0 and worker-1.

Common logs include the logs for pip-requirement.txt, training process, and ModelArts.

ModelArts logs can be filtered in the common log file modelarts-job-[job id]-[task id].log using the following keywords: [ModelArts Service Log] or Platform=ModelArts-Service.

  • Type 1: [ModelArts Service Log] xxx
    [ModelArts Service Log][init] download code_url: s3://dgg-test-user/snt9-test-cases/mindspore/lenet/
  • Type 2: time="xxx" level="xxx" msg="xxx" file="xxx" Command=xxx Component=xxx Platform=xxx
    time="2021-07-26T19:24:11+08:00" level=info msg="start the periodic upload task, upload period = 5 seconds " file="upload.go:46" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service

Ascend Logs

Ascend logs are generated when Ascend resources are used to for training. When Ascend resources are used for training, device logs, plog logs, proc logs for single-card training logs, MindSpore logs, and common logs are generated.

Common logs in the Ascend training scenario include the logs for pip-requirement.txt, ma-pre-start, davincirun, training process, and ModelArts.

The following is an example of the Ascend log structure:
obs://dgg-test-user/snt9-test-cases/log-out/                                    # Job log path
├──modelarts-job-9ccf15f2-6610-42f9-ab99-059ba049a41e
	├── ascend                                                                
		├── process_log
                    ├── rank_0
			        ├── plog                                          # Plog logs
                                        ...		
				├── device-0                                      # Device logs
                                        ...
	├── mindspore                                                             # MindSpore logs
├──modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-worker-0.log                # Common logs
├──modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-proc-rank-0-device-0.txt    # proc log for single-card training logs
Table 2 Ascend log description

Type

Description

Name

Device logs

User process AICPU and HCCP logs generated on the device and sent back to the host (training container).

If any of the following situations occur, device logs cannot be obtained:

  • The compute node restarts unexpectedly.
  • The compute node stops expectedly.

After the training process ends, the log is generated in the training container. The device logs for training using the preset MindSpore image are automatically uploaded to OBS. To automatically upload device logs for training using other preset images or custom images to OBS, specify ASCEND_PROCESS_LOG_PATH in the code. For details, see this sample code.

# set npu plog env
ma_vj_name=`echo ${MA_VJ_NAME} | sed 's:ma-job:modelarts-job:g'`
task_name="worker-${VC_TASK_INDEX}"
task_plog_path=${MA_LOG_DIR}/${ma_vj_name}/${task_name}

mkdir -p ${task_plog_path}
export ASCEND_PROCESS_LOG_PATH=${task_plog_path}

~/ascend/log/device-{device-id}/device-{pid}_{timestamp}.log

In the preceding command, pid indicates the user process ID on the host.

Example:

device-166_20220718191853764.log

Plog logs

User process logs, for example, ACL/GE.

Plog logs are generated in the training container. The plog logs for training using the preset MindSpore image are automatically uploaded to OBS. To automatically upload plog logs for training using custom images to OBS, specify ASCEND_PROCESS_LOG_PATH in the code. For details, see this sample code.

# set npu plog env
ma_vj_name=`echo ${MA_VJ_NAME} | sed 's:ma-job:modelarts-job:g'`
task_name="worker-${VC_TASK_INDEX}"
task_plog_path=${MA_LOG_DIR}/${ma_vj_name}/${task_name}

mkdir -p ${task_plog_path}
export ASCEND_PROCESS_LOG_PATH=${task_plog_path}

~/ascend/log/plog/plog-{pid}_{timestamp}.log

In the preceding command, pid indicates the user process ID on the host.

Example: plog-166_20220718191843620.log

proc log

proc log is a redirection file of single-node training logs, helping you quickly obtain logs of a compute node. Training jobs using custom images do not involve proc log. proc log for training using a preset image is generated in the training container and automatically saved in OBS.

[modelarts-job-uuid]-proc-rank-[rank id]-device-[device logic id].txt

  • device id indicates the ID of the NPU used in the training job. The value is 0 for a single NPU and 0 to 7 for eight NPUs.

    For example, if the Ascend specification is 8*Snt9, the value of device id ranges from 0 to 7. If the Ascend specification is 1*Snt9, the value of device id is 0.

  • rank id indicates the global NPU ID of the training job. The value ranges from 0 to the number of instances multiplied by the number of NPUs minus 1. If a single instance is used, the value of rank id is the same as that of device id.

Example:

modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-proc-rank-0-device-0.txt

MindSpore logs

Separate MindSpore logs are generated in the MindSpore+Ascend training scenario.

MindSpore logs are generated in the training container. The plog logs for training using the preset MindSpore image are automatically uploaded to OBS. To automatically upload plog logs for training using custom images to OBS, specify ASCEND_PROCESS_LOG_PATH in the code. For details, see this sample code.

# set npu plog env
ma_vj_name=`echo ${MA_VJ_NAME} | sed 's:ma-job:modelarts-job:g'`
task_name="worker-${VC_TASK_INDEX}"
task_plog_path=${MA_LOG_DIR}/${ma_vj_name}/${task_name}

mkdir -p ${task_plog_path}
export ASCEND_PROCESS_LOG_PATH=${task_plog_path}

For details about MindSpore logs, visit the MindSpore official website.

Common training logs

Common training logs are generated in the /home/ma-user/modelarts/log directory of the training container and automatically uploaded to OBS. The common training logs include these types:

  • Logs for ma-pre-start (specific to Ascend training): If the ma-pre-start script is defined, the script execution log is generated.
  • Logs for davincirun (specific to Ascend training): log generated when the Ascend training process is started using the davincirun.py file
  • Training process logs: standard output of user training code
  • Logs for pip-requirement.txt: If pip-requirement.txt is defined in training code, pip package installation logs are generated.
  • ModelArts logs: used by O&M personnel to locate service faults.

Contained in the modelarts-job-[job id]-[task id].log file.

task id indicates the instance ID. If a single node is used, the value is worker-0. If multiple nodes are used, the value is worker-0, worker-1, ..., or worker-{n-1}. n indicates the number of instances.

Example:

modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-worker-0.log

In the Ascend training scenario, after the training process exits, ModelArts uploads the log files in the training container to the OBS directory specified by Job Log Path. On the job details page, you can obtain the job log path and click the OBS address to go to the OBS console to check logs.

Figure 3 Job Log Path

You can run the ma-pre-start script to modify the default environment variable configurations.

ASCEND_GLOBAL_LOG_LEVEL=3      # Log level, 0 for debug, 1 for info, 2 for warning, and 3 for error.
ASCEND_SLOG_PRINT_TO_STDOUT=1  # Whether to display plog logs. The value 1 indicates that plog logs are displayed by default.
ASCEND_GLOBAL_EVENT_ENABLE=1   # Event log level, 0 for disabling event logging and 1 for enabling event logging.

Place the ma-pre-start.sh or ma-pre-start.py script in the directory at the same level as the training boot file.

Before the training boot file is executed, the system executes the ma-pre-start script in /home/work/user-job-dir/. This method can be used to update the Ascend RUN package installed in the container image or set some additional global environment variables required for training.

Viewing Training Job Logs

On the training job details page, you can preview logs, download logs, search for logs by keyword, and filter system logs in the log pane.

  • Previewing logs

    You can preview training logs on the system log pane. If multiple compute nodes are used, you can choose the target node from the drop-down list on the right.

    Figure 4 Viewing logs of different compute nodes

    If a log file is oversized, the system displays only the latest logs in the log pane. To view all logs, click the link in the upper part of the log pane, which will direct you to a new page. Then you will be redirected to a new page.

    Figure 5 Viewing all logs
    NOTE:
    • If the total size of all logs exceeds 500 MB, the log page may be frozen. In this case, download the logs to view them locally.
    • A log preview link can be accessed by anyone within one hour after it is generated. You can share the link with others.
    • Ensure that no privacy information is contained in the logs. Otherwise, information leakage may occur.
  • Downloading logs

    Training logs are retained for only 30 days. To permanently store logs, click the download icon in the upper right corner of the log pane. You can download the logs of multiple compute nodes in a batch. You can also enable Persistent Log Saving and set a log path when you create a training job. In this way, the logs will be automatically stored in the specified OBS path.

    If a training job is created on Ascend compute nodes, certain system logs cannot be downloaded in the training log pane. To obtain these logs, go to the Job Log Path you set when you created the training job.

    Figure 6 Downloading logs
  • Searching for logs by keyword

    In the upper right corner of the log pane, enter a keyword in the search box to search for logs, as shown in Figure 7.

    Figure 7 Searching for logs by keyword

    The system will highlight the keyword and redirect you between search results. Only the logs loaded in the log pane can be searched for. If the logs are not fully displayed (see the message displayed on the page), obtain all the logs by downloading them or clicking the full log link and then search for the logs. On the page redirected by the full log link, press Ctrl+F to search for logs.

  • Filtering system logs
    Figure 8 System logs

    If System logs is selected, system logs and user logs are displayed. If System logs is deselected, only user logs are displayed.

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback