Viewing Training Job Events
Scenario
Throughout the entire lifecycle of a training job, starting from the stage visible to you, the system backend records every key event point. You can view these records at any time on the details page of the corresponding training job. This allows you to clearly understand the progress and status of the training job, ensuring information transparency and traceability.
Event List
This helps you better understand the running process of a training job and locate faults more accurately when a task exception occurs. The following job events are supported:
- Training job created.
- Training job failures.
- Preparations timed out. The possible cause is that the cross-region algorithm synchronization or creating shared storage timed out.
- The training job is queuing and awaiting resource allocation.
- Failed to be queued.
- The training job starts to run.
- Training job executed.
- Failed to run the training job.
- The training job is preempted.
- The system detects that your training job may be suspended. Go to the job details page to view the cause and handle the issue.
- The training job has been restarted.
- The training job has been manually stopped.
- The training job has been stopped. (Maximum running duration: x hours)
- The training job has been manually deleted.
- Billing information synchronized.
- [worker-0] [Duration: second] Environment pre-check completed.
- [worker-0] [Duration: second] Pre-check failed. Exception:
- [worker-0] [Duration: second] Pre-check failed. Error:
- [worker-0] [Duration: second] Training code downloaded.
- [worker-0] [Duration: second] Failed to download the training code. Failure cause:
- [worker-0] [Duration: second] Training input (parameter: xxx) downloaded.
- [worker-0] [Duration: second] Failed to download the training input (parameter: xxx). Failure cause:
- [worker-0] [Duration: second] Training output (parameter: xxx) uploaded.
- [worker-0] [Duration: second] Training output () prefetched.
- [worker-0] [Duration: second] Training pre-startup script executed.
- [worker-0] [Duration: second] Python dependency packages installed.
- [worker-0] rankTable training acceleration library installed.
- [worker-0] modelarts-turbo training acceleration library installed.
- [worker-0] turbo training acceleration library installed.
- [worker-0] Training discover library installed.
- [worker-0] The training output and log upload processes exit, and files will not be synchronized.
- [worker-0] Training container heartbeat detection timed out.
- [worker-0] The training job starts to run.
- [worker-0] Training started.
- [worker-0] Training completed, exit code.
- [worker-0] [Duration: second] Training output (parameter: xxx) uploaded.
During the training process, key events can be manually or automatically refreshed.
Notes and Constraints
The system automatically stores training job events for 30 days, and any expired events will be deleted.
Procedure
- On the ModelArts console, choose Model Build > Training in the navigation pane. (On the old console, choose Model Training > Training Jobs.)
- In the training job list, click the name of the target job to go to the training job details page.
- On the training job details page, click the Events tab to view the event type, event source, event information, and event occurrence time. Figure 1 Events
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot