Viewing Training Job Events
Any key event of a training job will be recorded at the backend after the training job is displayed for you. You can check events on the training job details page.
This helps you better understand the running process of a training job and locate faults more accurately when a task exception occurs. The following job events are supported:
- Training job created.
- Training job failures:
- Preparations timed out. The possible cause is that the cross-region algorithm synchronization or creating shared storage timed out.
- The training job is queuing and awaiting resource allocation.
- Failed to be queued.
- The training job starts to run.
- Training job executed.
- Failed to run the training job.
- The training job is preempted.
- The system detects that your training job may be suspended. Go to the job details page to view the cause and handle the issue.
- The training job has been restarted.
- The training job has been manually stopped.
- The training job has been stopped. (Maximum running duration: x hours)
- The training job has been manually deleted.
- Billing information synchronized.
- [worker-0] [Duration: second] Environment pre-check completed.
- [worker-0] [Duration: second] Pre-check failed. Exception:
- [worker-0] [Duration: second] Pre-check failed. Error:
- [worker-0] [Duration: second] Training code downloaded.
- [worker-0] [Duration: second] Failed to download the training code. Failure cause:
- [worker-0] [Duration: second] Training input (parameter: xxx) downloaded.
- [worker-0] [Duration: second] Failed to download the training input (parameter: xxx). Failure cause:
- [worker-0] [Duration: second] Training output (parameter: xxx) uploaded.
- [worker-0] [Duration: second] Training output () prefetched.
- [worker-0] [Duration: second] Training pre-startup script executed.
- [worker-0] [Duration: second] Python dependency packages installed.
- [worker-0] rankTable training acceleration library installed.
- [worker-0] modelarts-turbo training acceleration library installed.
- [worker-0] turbo training acceleration library installed.
- [worker-0] Training discover library installed.
- [worker-0] The training output and log upload processes exit, and files will not be synchronized.
- [worker-0] Training container heartbeat detection timed out.
- [worker-0] The training job starts to run.
- [worker-0] Training started.
- [worker-0] Training completed. Exit code
- [worker-0] [Duration: second] Training output (parameter: xxx) uploaded.
During the training process, key events can be manually or automatically refreshed.
Notes and Constraints
The system automatically stores training job events for 30 days, and any expired events will be deleted.
Procedure
- On the ModelArts console, choose Model Training > Training Jobs from the navigation pane.
- In the training job list, click the name of the target job to go to the training job details page.
- Click Events to view events.
Figure 1 Events
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot