Updated on 2024-10-29 GMT+08:00

Viewing Training Job Logs

Overview

Training logs record the runtime process and exception information of training jobs and provide useful details for fault location. The standard output and standard error information in your code are displayed in training logs. If you encounter an issue during the execution of a ModelArts training job, view logs first. In most scenarios, you can locate the issue based on the error information reported in logs.

Training logs include common training logs and Ascend logs.

  • Common Logs: When resources other than Ascend are used for training, only common training logs are generated. Common logs include the logs for pip-requirement.txt, training process, and ModelArts.
  • Ascend Logs: When Ascend resources are used for training, device logs, plog logs, proc log for single-card training logs, MindSpore logs, and common logs are generated.
Figure 1 ModelArts training logs

Separate MindSpore logs are generated only in the MindSpore+Ascend training scenario. Logs of other AI engines are contained in common logs.

Retention Period

Logs are classified into the following types based on the retention period:

  • Real-time logs: generated during training job running and can be viewed on the ModelArts training job details page.
  • Historical logs: After a training job is completed, you can view its historical logs on the ModelArts training job details page. ModelArts automatically stores the logs for 30 days.
  • Permanent logs: These logs are dumped to your OBS bucket. When creating a training job, you can enable persistent log saving and set a job log path for dumping. For Ascend training, you need to configure the OBS path for storing training logs by default. You need to manually enable Persistent Log Saving for training jobs using other resources.
    Figure 2 Enabling persistent log saving

Real-time logs and historical logs have no difference in content. In the Ascend training scenario, permanent logs contain Ascend logs, which are not displayed on ModelArts.

Common Logs

Common logs include the logs for pip-requirement.txt, training process, and ModelArts Standard.

Table 1 Common log types

Type

Description

Training process log

Standard output of your training code.

Installation logs for pip-requirement.txt

If pip-requirement.txt is defined in training code, pip package installation logs are generated.

ModelArts logs

ModelArts logs are used by O&M personnel to locate service faults.

The format of a common log file is as follows. task id is the node ID of a training job.

Unified log format: modelarts-job-[job id]-[task id].log
Example: log/modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-worker-0.log
  • Single-node training jobs generate a log file, and task id defaults to worker-0.
  • Distributed training generates multiple node log files, which are distinguished by task id, such as worker-0 and worker-1.

Common logs include the logs for pip-requirement.txt, training process, and ModelArts.

ModelArts logs can be filtered in the common log file modelarts-job-[job id]-[task id].log using the following keywords: [ModelArts Service Log] or Platform=ModelArts-Service.

  • Type 1: [ModelArts Service Log] xxx
    [ModelArts Service Log][init] download code_url: s3://dgg-test-user/snt9-test-cases/mindspore/lenet/
  • Type 2: time="xxx" level="xxx" msg="xxx" file="xxx" Command=xxx Component=xxx Platform=xxx
    time="2021-07-26T19:24:11+08:00" level=info msg="start the periodic upload task, upload period = 5 seconds " file="upload.go:46" Command=obs/upload Component=ma-training-toolkit Platform=ModelArts-Service

Ascend Logs

Ascend logs are generated when Ascend resources are used to for training. When Ascend resources are used for training, device logs, plog logs, proc logs for single-card training logs, MindSpore logs, and common logs are generated.

Common logs in the Ascend training scenario include the logs for pip-requirement.txt, ma-pre-start, davincirun, training process, and ModelArts.

The following is an example of the Ascend log structure:
obs://dgg-test-user/snt9-test-cases/log-out/                                    # Job log path
├──modelarts-job-9ccf15f2-6610-42f9-ab99-059ba049a41e
	├── ascend                                                                
		├── process_log
                    ├── rank_0
			        ├── plog                                          # Plog logs
                                        ...		
				├── device-0                                      # Device logs
                                        ...
	├── mindspore                                                             # MindSpore logs
├──modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-worker-0.log                # Common logs
├──modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-proc-rank-0-device-0.txt    # proc log for single-card training logs
Table 2 Ascend log description

Type

Description

Name

Device logs

User process AICPU and HCCP logs generated on the device and sent back to the host (training container).

If any of the following situations occur, device logs cannot be obtained:

  • The compute node restarts unexpectedly.
  • The compute node stops expectedly.

After the training process ends, the log is generated in the training container. The device logs for training using the preset MindSpore image are automatically uploaded to OBS. To automatically upload device logs for training using other preset images or custom images to OBS, specify ASCEND_PROCESS_LOG_PATH in the code. For details, see this sample code.

# set npu plog env
ma_vj_name=`echo ${MA_VJ_NAME} | sed 's:ma-job:modelarts-job:g'`
task_name="worker-${VC_TASK_INDEX}"
task_plog_path=${MA_LOG_DIR}/${ma_vj_name}/${task_name}

mkdir -p ${task_plog_path}
export ASCEND_PROCESS_LOG_PATH=${task_plog_path}

~/ascend/log/device-{device-id}/device-{pid}_{timestamp}.log

In the preceding command, pid indicates the user process ID on the host.

Example:

device-166_20220718191853764.log

Plog logs

User process logs, for example, ACL/GE.

Plog logs are generated in the training container. The plog logs for training using the preset MindSpore image are automatically uploaded to OBS. To automatically upload plog logs for training using custom images to OBS, specify ASCEND_PROCESS_LOG_PATH in the code. For details, see this sample code.

# set npu plog env
ma_vj_name=`echo ${MA_VJ_NAME} | sed 's:ma-job:modelarts-job:g'`
task_name="worker-${VC_TASK_INDEX}"
task_plog_path=${MA_LOG_DIR}/${ma_vj_name}/${task_name}

mkdir -p ${task_plog_path}
export ASCEND_PROCESS_LOG_PATH=${task_plog_path}

~/ascend/log/plog/plog-{pid}_{timestamp}.log

In the preceding command, pid indicates the user process ID on the host.

Example: plog-166_20220718191843620.log

proc log

proc log is a redirection file of single-node training logs, helping you quickly obtain logs of a compute node. Training jobs using custom images do not involve proc log. proc log for training using a preset image is generated in the training container and automatically saved in OBS.

[modelarts-job-uuid]-proc-rank-[rank id]-device-[device logic id].txt

  • device id indicates the ID of the NPU used in the training job. The value is 0 for a single NPU and 0 to 7 for eight NPUs.

    For example, if the Ascend specification is 8*Snt9, the value of device id ranges from 0 to 7. If the Ascend specification is 1*Snt9, the value of device id is 0.

  • rank id indicates the global NPU ID of the training job. The value ranges from 0 to the number of compute nodes multiplied by the number of NPUs minus 1. If a single compute node is used, the value of rank id is the same as that of device id.

Example:

modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-proc-rank-0-device-0.txt

MindSpore logs

Separate MindSpore logs are generated in the MindSpore+Ascend training scenario.

MindSpore logs are generated in the training container. The plog logs for training using the preset MindSpore image are automatically uploaded to OBS. To automatically upload plog logs for training using custom images to OBS, specify ASCEND_PROCESS_LOG_PATH in the code. For details, see this sample code.

# set npu plog env
ma_vj_name=`echo ${MA_VJ_NAME} | sed 's:ma-job:modelarts-job:g'`
task_name="worker-${VC_TASK_INDEX}"
task_plog_path=${MA_LOG_DIR}/${ma_vj_name}/${task_name}

mkdir -p ${task_plog_path}
export ASCEND_PROCESS_LOG_PATH=${task_plog_path}

For details about MindSpore logs, visit the MindSpore official website.

Common training logs

Common training logs are generated in the /home/ma-user/modelarts/log directory of the training container and automatically uploaded to OBS. The common training logs include these types:

  • Logs for ma-pre-start (specific to Ascend training): If the ma-pre-start script is defined, the script execution log is generated.
  • Logs for davincirun (specific to Ascend training): log generated when the Ascend training process is started using the davincirun.py file
  • Training process logs: standard output of user training code
  • Logs for pip-requirement.txt: If pip-requirement.txt is defined in training code, pip package installation logs are generated.
  • ModelArts logs: used by O&M personnel to locate service faults.

Contained in the modelarts-job-[job id]-[task id].log file.

task id indicates the compute node ID. If a single node is used, the value is worker-0. If multiple nodes are used, the value is worker-0, worker-1, ..., or worker-{n-1}. n indicates the number of compute nodes.

Example:

modelarts-job-95f661bd-1527-41b8-971c-eca55e513254-worker-0.log

In the Ascend training scenario, after the training process exits, ModelArts uploads the log files in the training container to the OBS directory specified by Job Log Path. On the job details page, you can obtain the job log path and click the OBS address to go to the OBS console to check logs.

Figure 3 Job Log Path

You can run the ma-pre-start script to modify the default environment variable configurations.

ASCEND_GLOBAL_LOG_LEVEL=3      # Log level, 0 for debug, 1 for info, 2 for warning, and 3 for error.
ASCEND_SLOG_PRINT_TO_STDOUT=1  # Whether to display plog logs. The value 1 indicates that plog logs are displayed by default.
ASCEND_GLOBAL_EVENT_ENABLE=1   # Event log level, 0 for disabling event logging and 1 for enabling event logging.

Place the ma-pre-start.sh or ma-pre-start.py script in the directory at the same level as the training boot file.

Before the training boot file is executed, the system executes the ma-pre-start script in /home/work/user-job-dir/. This method can be used to update the Ascend RUN package installed in the container image or set some additional global environment variables required for training.

Viewing Training Job Logs

On the training job details page, you can preview logs, download logs, search for logs by keyword, and filter system logs in the log pane.

  • Previewing logs

    You can preview training logs on the system log pane. If multiple compute nodes are used, you can choose the target node from the drop-down list on the right.

    Figure 4 Viewing logs of different compute nodes

    If a log file is oversized, the system displays only the latest logs in the log pane. To view all logs, click the link in the upper part of the log pane, which will direct you to a new page. Then you will be redirected to a new page.

    Figure 5 Viewing all logs
    • If the total size of all logs exceeds 500 MB, the log page may be frozen. In this case, download the logs to view them locally.
    • A log preview link can be accessed by anyone within one hour after it is generated. You can share the link with others.
    • Ensure that no privacy information is contained in the logs. Otherwise, information leakage may occur.
  • Downloading logs

    Training logs are retained for only 30 days. To permanently store logs, click the download icon in the upper right corner of the log pane. You can download the logs of multiple compute nodes in a batch. You can also enable Persistent Log Saving and set a log path when you create a training job. In this way, the logs will be automatically stored in the specified OBS path.

    If a training job is created on Ascend compute nodes, certain system logs cannot be downloaded in the training log pane. To obtain these logs, go to the Job Log Path you set when you created the training job.

    Figure 6 Downloading logs
  • Searching for logs by keyword

    In the upper right corner of the log pane, enter a keyword in the search box to search for logs, as shown in Figure 7.

    Figure 7 Searching for logs by keyword

    The system will highlight the keyword and redirect you between search results. Only the logs loaded in the log pane can be searched for. If the logs are not fully displayed (see the message displayed on the page), obtain all the logs by downloading them or clicking the full log link and then search for the logs. On the page redirected by the full log link, press Ctrl+F to search for logs.

  • Filtering system logs
    Figure 8 System logs

    If System logs is selected, system logs and user logs are displayed. If System logs is deselected, only user logs are displayed.