Selecting a Training Mode

ModelArts provides different training modes for MindSpore engines and enables you to obtain different diagnosis information based on actual scenarios.

On the training job creation page, you can select General, High performance, or Fault diagnosis for training mode. The default value is General. For details about debugging information in General mode, see Training Log Details.

Use High performance and Fault diagnosis in the following scenarios:

High performance: In high performance mode, certain O&M functions will be adjusted or even disabled to maximally accelerate the running speed, but this will deteriorate fault locating. This mode is suitable for stable networks requiring high performance.
Fault diagnosis: In fault diagnosis mode, certain O&M functions will be enabled or adjusted to collect more information for locating faults. This mode provides fault diagnosis. You can select a diagnosis type as required.

Figure 1 Mode selection

The following table details debugging information obtained in each mode.

**Table 1** Debugging information obtained in each mode
Debugging Information	General	High performance	Fault diagnosis	Description
MindSpore log levels	Info level	Error level	Info level	MindSpore framework runtime log
Running Data Recorder (RDR)	Disabled	Disabled	Enabled	If a running exception occurs, the recorded MindSpore data is automatically exported to help locate the exception cause. Different data is exported for different exceptions. For details about RDR, see MindSpore Documentation.
analyze_fail.dat	Enabled by default and uploaded to the training job log path			Graph build failure information is automatically exported for inference process analysis.
Dump data	Enabled by default and uploaded to the training job log path			Dump data is exported when an exception occurs during backend running.

In the fault diagnosis mode, after the fault diagnosis function is enabled, you can view the following fault diagnosis data: The following data is stored in the OBS directory in the training log path.

Description of the training output log file in the fault diagnosis mode:

{obs-log-path}/
    modelarts-job-{job-id}-worker-{index}.log # Displayed log summary
    modelarts-job-{job-id}-proc-rank-{rank-id}-device-{device-id}.txt # Displayed logs of each device
    modelarts-job-{job-id}/
        ascend/
            npu_collect/rank_{id}/ # Output path for TFAdapter DUMP GRAPH and GE DUMP GRAPH, generated only for the TensorFlow framework
            process_log/rank_{id}/ # Plog log path
            msnpureport/{task-index}/ # msnpureport tool execution logs, which you do not need to pay attention to
        mindspore/
            log/ # MindSpore framework logs and MindSpore fault diagnosis data

**Table 2** Fault diagnosis data of MindSpore
Category	Description
CANN framework logs and fault diagnosis data	Host logs of the INFO or higher levels, including CANN software stack logs and driver logs.
MindSpore framework logs and fault diagnosis data	MindSpore framework logs of the INFO or higher levels.
	RDR file. If a running exception occurs, the recorded MindSpore data is automatically exported to help locate the exception cause. Different data is exported for different exceptions.
	analyze_fail.dat. Graph build failure information is automatically exported for inference process analysis.
	Dump data, which is exported when an exception occurs during backend running