Collecting and Storing Logs

Viewing User Training Logs (Proc Log)

Proc logs display outputs from user training code. In single-node or multi-node Snt training jobs with multiple PUs, each Snt accelerator card creates a Python log on the screen. These logs from different training processes on one node combine into the training log file mentioned in Viewing Logs and Weights. For troubleshooting, check the individual log files for each training process.

docker exec -it ${Container ID} bash
cd {container_work_dir}/{af_output_dir}/logs

{container_work_dir} is the working directory in the container when the container image is started.
{af_output_dir} is the directory for storing weights and logs configured in the YAML file of the training job.

CANN Application Logs

CANN logs record application activities generated by CANN. They consist of two types: host application logs (plog-{pid}-{time}.log) and device application logs (device-{pid}-{time}.log). These logs primarily contain:

Logs printed by components (such as GE, FE, AI CPU, TBE, and HCCL) in the compiler and components (such as AscendCL, GE, and Runtime) in the runtime.
Logs printed by the AI CPUs and HCCP on the device.

docker exec -it ${Container ID} bash
cd {container_work_dir}/{af_output_dir}/plog

{container_work_dir} is the working directory in the container when the container image is started.
{af_output_dir} is the directory for storing weights and logs configured in the YAML file of the training job.
The run/plog directory contains run logs, and the debug/plog directory contains debug logs for fault analysis and locating.

Parent topic: Adapting Mainstream Open-Source Models to AscendFactory NPU Training Based on Lite Server

Previous topic: Viewing Training Output Results

Next topic: (Optional) Configuring Monitoring and Alarms