Updated on 2024-06-12 GMT+08:00

Selecting a Training Mode

ModelArts provides different training modes for MindSpore engines and enables you to obtain different diagnosis information based on actual scenarios.

On the training job creation page, you can select General, High performance, or Fault diagnosis for training mode. The default value is General. For details about debugging information in General mode, see Training Log Details.

Use High performance and Fault diagnosis in the following scenarios:

  • High performance: In high performance mode, certain O&M functions will be adjusted or even disabled to maximally accelerate the running speed, but this will deteriorate fault locating. This mode is suitable for stable networks requiring high performance.
  • Fault diagnosis: In fault diagnosis mode, certain O&M functions will be enabled or adjusted to collect more information for locating faults. This mode provides fault diagnosis. You can select a diagnosis type as required.
Figure 1 Mode selection

The following table details debugging information obtained in each mode.

Table 1 Debugging information obtained in each mode

Debugging Information

General

High performance

Fault diagnosis

Description

MindSpore log levels

Info level

Error level

Info level

MindSpore framework runtime log

Running Data Recorder (RDR)

Disabled

Disabled

Enabled

If a running exception occurs, the recorded MindSpore data is automatically exported to help locate the exception cause. Different data is exported for different exceptions.

For details about RDR, see MindSpore Documentation.

analyze_fail.dat

Enabled by default and uploaded to the training job log path

Graph build failure information is automatically exported for inference process analysis.

Dump data

Enabled by default and uploaded to the training job log path

Dump data is exported when an exception occurs during backend running.

In the fault diagnosis mode, after the fault diagnosis function is enabled, you can view the following fault diagnosis data: The following data is stored in the OBS directory in the training log path.

Description of the training output log file in the fault diagnosis mode:

{obs-log-path}/
    modelarts-job-{job-id}-worker-{index}.log # Displayed log summary
    modelarts-job-{job-id}-proc-rank-{rank-id}-device-{device-id}.txt # Displayed logs of each device
    modelarts-job-{job-id}/
        ascend/
            npu_collect/rank_{id}/ # Output path for TFAdapter DUMP GRAPH and GE DUMP GRAPH, generated only for the TensorFlow framework
            process_log/rank_{id}/ # Plog log path
            msnpureport/{task-index}/ # msnpureport tool execution logs, which you do not need to pay attention to
        mindspore/
            log/ # MindSpore framework logs and MindSpore fault diagnosis data
Table 2 Fault diagnosis data of MindSpore

Category

Description

CANN framework logs and fault diagnosis data

Host logs of the INFO or higher levels, including CANN software stack logs and driver logs.

MindSpore framework logs and fault diagnosis data

MindSpore framework logs of the INFO or higher levels.

RDR file.

If a running exception occurs, the recorded MindSpore data is automatically exported to help locate the exception cause. Different data is exported for different exceptions.

analyze_fail.dat. Graph build failure information is automatically exported for inference process analysis.

Dump data, which is exported when an exception occurs during backend running

On the training job creation page, select the MindSpore algorithm and set Resource Type to Ascend, and then you can enable fault diagnosis.

Figure 2 Resource Type
Figure 3 Enabling fault diagnosis