Selecting a Training Mode
ModelArts provides different training modes for MindSpore engines and enables you to obtain different diagnosis information based on actual scenarios.
On the training job creation page, you can select General, High performance, or Fault diagnosis for training mode. The default value is General. For details about debugging information in General mode, see Training Log Details.
Use High performance and Fault diagnosis in the following scenarios:
- High performance: In high performance mode, certain O&M functions will be adjusted or even disabled to maximally accelerate the running speed, but this will deteriorate fault locating. This mode is suitable for stable networks requiring high performance.
- Fault diagnosis: In fault diagnosis mode, certain O&M functions will be enabled or adjusted to collect more information for locating faults. This mode provides fault diagnosis. You can select a diagnosis type as required.
The following table details debugging information obtained in each mode.
Debugging Information |
General |
High performance |
Fault diagnosis |
Description |
---|---|---|---|---|
MindSpore log levels |
Info level |
Error level |
Info level |
MindSpore framework runtime log |
Running Data Recorder (RDR) |
Disabled |
Disabled |
Enabled |
If a running exception occurs, the recorded MindSpore data is automatically exported to help locate the exception cause. Different data is exported for different exceptions. For details about RDR, see MindSpore Documentation. |
analyze_fail.dat |
Enabled by default and uploaded to the training job log path |
Graph build failure information is automatically exported for inference process analysis. |
||
Dump data |
Enabled by default and uploaded to the training job log path |
Dump data is exported when an exception occurs during backend running. |
In the fault diagnosis mode, after the fault diagnosis function is enabled, you can view the following fault diagnosis data: The following data is stored in the OBS directory in the training log path.
Description of the training output log file in the fault diagnosis mode:
{obs-log-path}/ modelarts-job-{job-id}-worker-{index}.log # Displayed log summary modelarts-job-{job-id}-proc-rank-{rank-id}-device-{device-id}.txt # Displayed logs of each device modelarts-job-{job-id}/ ascend/ npu_collect/rank_{id}/ # Output path for TFAdapter DUMP GRAPH and GE DUMP GRAPH, generated only for the TensorFlow framework process_log/rank_{id}/ # Plog log path msnpureport/{task-index}/ # msnpureport tool execution logs, which you do not need to pay attention to mindspore/ log/ # MindSpore framework logs and MindSpore fault diagnosis data
Category |
Description |
---|---|
CANN framework logs and fault diagnosis data |
Host logs of the INFO or higher levels, including CANN software stack logs and driver logs. |
MindSpore framework logs and fault diagnosis data |
MindSpore framework logs of the INFO or higher levels. |
RDR file. If a running exception occurs, the recorded MindSpore data is automatically exported to help locate the exception cause. Different data is exported for different exceptions. |
|
analyze_fail.dat. Graph build failure information is automatically exported for inference process analysis. |
|
Dump data, which is exported when an exception occurs during backend running |
On the training job creation page, select the MindSpore algorithm and set Resource Type to Ascend, and then you can enable fault diagnosis.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.