Updated on 2024-05-07 GMT+08:00

Viewing Training Job Events

Any key event of a training job will be recorded at the backend after the training job is displayed for you. You can check events on the training job details page.

This helps you better understand the running process of a training job and locate faults more accurately when a task exception occurs. The following job events are supported:

  • Training job created.
  • Training job failures:
  • Preparations timed out. The possible cause is that the cross-region algorithm synchronization or creating shared storage timed out.
  • The training job is queuing and awaiting resource allocation.
  • Failed to be queued.
  • The training job starts to run.
  • Training job executed.
  • Failed to run the training job.
  • The training job is preempted.
  • The system detects that your training job may be suspended. Go to the job details page to view the cause and handle the issue.
  • The training job has been restarted.
  • The training job has been manually stopped.
  • The training job has been stopped. (Maximum running duration: 1 hour)
  • The training job has been stopped. (Maximum running duration: 3 hours)
  • The training job has been manually deleted.
  • Billing information synchronized.
  • [worker-0] The training environment is being pre-checked.
  • [worker-0] [Duration: second] Pre-check completed.
  • [worker-0] [Duration: second] Pre-check failed. Error: xxx
  • [worker-0] [Duration: second] Pre-check failed. Error: xxx
  • [worker-0] The training code is being downloaded.
  • [worker-0] [Duration: second] Training code downloaded.
  • [worker-0] [Duration: second] Failed to download the training code. Failure cause:
  • [worker-0] The training input is being downloaded.
  • [worker-0] [Duration: second] Training input (parameter: xxx) downloaded.
  • [worker-0] [Duration: second] Failed to download the training input (parameter: xxx). Failure cause:
  • [worker-0] Python dependency packages are being installed. Import the following files:
  • [worker-0] [Duration: second] Python dependency packages installed. Import the following files:
  • [worker-0] The training job starts to run.
  • [worker-0] Training job executed.
  • [worker-0] The training input is being uploaded.
  • [worker-0] [Duration: second] Training output (parameter: xxx) uploaded.

During the training process, key events can be manually or automatically refreshed.

Procedure

  1. On the ModelArts console, choose Training Management > Training Jobs from the navigation pane.
  2. In the training job list, click the name of the target job to go to the training job details page.
  3. Click Events to view events.
    Figure 1 Events