Viewing Training Jobs and Details (New Console)
Description
This topic is specific to CN Southwest-Guiyang1. The console uses the new UI version.
When using ModelArts model training, you may need to view the list of ongoing training jobs and their details to ensure the training process is proceeding smoothly.
Through the job list page, you can easily monitor the status of each job, customize the displayed content, filter by required attributes, and quickly locate specific training jobs.
Click the target job name to view its details.
Viewing Training Jobs
- Log in to the ModelArts console.
- In the navigation pane, choose Model Build > Training.
- On the displayed page, you can view key information such as the job name/ID, status, priority, training mode, progress, duration, tags, creation time, creator, and description. Some columns support filtering or sorting. Click
on the right of the job search box to set and adjust the content displayed in the job list. - You can select All, Mine, or a custom time range to quickly filter the visible job list.
- In the search box above the training job list, you can filter training jobs by attribute type, such as name, ID, status, and training mode.
- In the Operation column, you can clone, terminate, or delete a specific job.
- On the displayed page, you can view key information such as the job name/ID, status, priority, training mode, progress, duration, tags, creation time, creator, and description. Some columns support filtering or sorting. Click
- In the training job list, filter jobs by Training Mode (e.g., Fine-tuning or Custom). Click the target job name to view its details.
Viewing Details About a Fine-Tuning Job
In the Model Training list, filter the Training Mode for Fine-tuning and click a job name to access the fine-tuning job details page. The model training details include the model output, task details, events, logs, and resource usage. In the upper right corner, you can perform actions such as Clone, Delete, or Retry based on the current model status.
The following sections describe the features available on each tab:
- Model output
The following model parameters can be viewed:
Final model asset: Name of the model generated after training. If the model name changes, this asset name will be updated accordingly.
Training type and objectives: Shows the specific fine-tuning method and objectives used for the job.
Training loss: Displays a loss curve graph. This visualization helps you evaluate whether the model meets your requirements.
Model asset information: Once fine-tuning is complete, the model must be published as an asset to be used in downstream tasks. You can click View assets to go to the details page of the fine-tuned model. For different versions of the same model, you can view the versions that have been published as assets on the model details page.
- Task details This tab displays the configuration settings used for the fine-tuning job, including basic information, dataset information, training configuration, resource configuration, HA configuration, system configuration, and workflow node details. For details, see Step 2: Configuring Fine-Tuning Parameters.Figure 1 Task details
- Events
This tab provides a timeline of events reported during job execution. This helps you monitor the progress of different training stages and detect potential anomalies early.
Figure 2 Events
- Logs
This tab displays critical runtime output. Use these logs to monitor the model fine-tuning process in detail and to troubleshoot or debug issues during execution. You can select different nodes and view their logs.
Figure 3 Logs
- Resource usage This tab provides monitoring of metrics during the job run, including the usage of CPU, NPU, and file systems. You can select different nodes and view their resource usage.Figure 4 Resource usage
- Tags
Displays the tags set for the fine-tuning job. Tags can also be edited.
Figure 5 Tags
Viewing Details About a Custom Job
In the training job list, filter the Training Mode for Custom and click a job name to access the custom job details page.
The following sections describe the features available on each tab.
| Category | Description |
|---|---|
| Basic job information. | |
| Event |
|
| Training logs record the execution process and error information of training jobs, helping you quickly locate issues that occur during operation. Both standard output and standard error from your code are displayed in the training logs. | |
| ModelArts provides Cloud Shell, which allows you to log in to a running container to debug training jobs in the production environment. | |
| Monitoring | You can use the monitoring feature to view the resource usage of training jobs and metrics of training jobs to quickly learn about the status of training jobs.
|
| ModelArts monitors training jobs in real time for smooth operations. The training job details page includes intelligent O&M tools for easy monitoring and maintenance. | |
| After a training job has been executed, ModelArts evaluates your model and provides optimization diagnosis and suggestions. | |
| You can add tags to a training job for quick search. |
Training Job Details
The job details page displays the basic information about a job.
| Category | Parameter | Description |
|---|---|---|
| Basic Information | Name | Name of the training job. |
| ID | Unique ID of the training job. | |
| Status |
| |
| Created | Time when the training job is created. | |
| Duration | Duration of a training job, which is the total duration of Kubernetes resources in the entire lifecycle of a training job. | |
| Description | Description of the training job. When left unset, -- appears. Click | |
| Training configuration | Image Type | Type of the image selected for the training job. Preset images and custom images are supported. |
| Image | Name of the image selected for the training job. | |
| Image Address | SWR address of the image. | |
| Code Directory | OBS path to the code directory of the training job. If this parameter is not configured, -- appears. You can click | |
| Code Backup Directory | OBS backup directory containing the contents of the training job code directory. | |
| Local Code Directory | Path to the training code in the training container. | |
| Boot Command | Command for booting an image. | |
| Environment Variable | Environment variables set for a training job. | |
| Resource configuration | Resource Type | Type of the resource pool selected for the training job. You can select a dedicated resource pool or a public resource pool. |
| Resource Pool | Name of the resource pool selected for the training job. This parameter is only available when the training job uses a dedicated resource pool. | |
| Specifications | Specifications used in the training job. This parameter shows the instance specifications for the training job, both allocated to the training containers and chosen during job creation.
| |
| Compute Nodes | Number of instances for the training job. | |
| Compute Node ID | By default, the number of current compute nodes is displayed. Click to display names and IP addresses of the compute nodes used by the training job. This parameter is only displayed when the training job uses a dedicated resource pool. | |
| Job Scheduling Priority |
| |
| Preemption |
| |
| Data Configuration | Training Dataset | Name of the dataset used for the training job. If it is not configured or is not enabled, Disabled is displayed. |
| HA Settings | Fault Tolerance and Recovery |
|
| Unconditional Auto Restart | Displays when Fault Tolerance and Recovery is enabled. When Unconditional Auto Restart is enabled during job creation, Open is displayed. If it is not configured or is not enabled, Disabled is displayed. | |
| Restart Upon Suspension | Displays when Fault Tolerance and Recovery is enabled. When Restart Upon Suspension is enabled during job creation, Open is displayed. If it is not configured or is not enabled, Disabled is displayed. | |
| Publish to Assets | Publish to Assets | When it is enabled, Enabled is displayed. The system will automatically publish model artifacts as assets, enabling operations such as inference and evaluation on the platform. If it is not configured or is not enabled, Disabled is displayed. |
| Model Output Path | The storage path and cloud mount path are separated by a vertical bar (|).
| |
| Auto-publish to Assets | When it is enabled, Enabled is displayed. The trained model will be automatically uploaded to the Asset Management > Models > My Models page. If it is not configured or is not enabled, Disabled is displayed. | |
| Model Name | Displays when Auto-publish to Assets is enabled. Name of the new model. Enter 2 to 128 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. The name must start with a letter and end with a letter or digit. | |
| Model Type | Displays when Auto-publish to Assets is enabled. Model type of the published model. | |
| Model Brand | Displays when Auto-publish to Assets is enabled. Model brand. | |
| Model Version | Displays when Auto-publish to Assets is enabled. If the model is published as a new model, the version number is V1. If the model is published as a new version of an existing model, the version number is automatically incremented by 1 based on the previous version number of the model. Note: The model version number cannot be modified and is automatically generated by the system. | |
| Description | Displays when Auto-publish to Assets is enabled. Description of the trained model. This field is optional and can contain a maximum of 256 characters. | |
| Access Settings | JupyterLab | JupyterLab address of the training job. This parameter is available only when JupyterLab is selected for Training Application. If it is not configured, Disabled is displayed. |
| Remote SSH | Key pair and SSH address for SSH remote development of the training job. This parameter is available only for training jobs with remote SSH enabled. If it is not configured, Disabled is displayed. | |
| Password-free SSH Between Instances | Information about the password-free SSH file configured for the training job. If it is not configured, Disabled is displayed. | |
| Observability Settings | TensorBoard | TensorBoard is a visualization tool package of TensorFlow. It provides visualization functions and tools required for machine learning experiments. With TensorBoard, computational graph during training, metric trends, and data used during training are effectively displayed. For details about TensorBoard, see the official website. This parameter is not displayed when a public resource pool is used. If enabled, the configured storage path is displayed. If it is not configured, Disabled is displayed. |
| MindStudio Insight | MindStudio Insight visualizes information such as scalars, images, computational graphs, and model hyperparameters during training. It supports training jobs based on the MindSpore engine. For details about MindStudio Insight, see MindSpore official website. This parameter is not displayed when a public resource pool is used. If enabled, the configured storage path is displayed. If it is not configured, Disabled is displayed. | |
| Interconnect Metrics with AOM |
If it is not configured, Disabled is displayed. | |
| More Configurations | Persistent Log Saving | After a log path is configured, the configured path is displayed. If it is not configured, Disabled is displayed. |
| Job Visibility | The options are Workspace and Creator.
| |
| Auto Stop | The auto stop time configured for the training job is displayed. The options are 1 hour, 2 hours, 4 hours, 6 hours, or custom. The custom value ranges from 1 to 720 hours. When you enable this function, the training stops automatically when the time limit is reached. The time limit does not count down when the training is paused. If it is not configured, Disabled is displayed. | |
| Retention Period | The retention period configured for the training job is displayed. The options are 1 hour, 2 hours, 4 hours, 6 hours, or custom. The custom value ranges from 1 to 720 hours. When you enable this parameter and set the duration, the training environment stays active for that time after the job succeeds or fails. The waiting time in the queue does not count toward this duration. If it is not configured, Disabled is displayed. | |
| Event Notification | Topic and events set for event notification during training job creation. If it is not configured, Disabled is displayed. |
- On the training details page, manage event notifications of the training job.
- Event notifications cannot be configured for training jobs in the Completed, Failed, Abnormal, or Terminated state.
- To set up event notifications, you need permission to view jobs.
- Only the updated training status is notified for modification events.
After event notification is enabled, you will be notified of a specific event, such as a job status change or suspected suspension, through an SMS message or email. Notifications will be billed based on SMN pricing. For details, see Billing.
- If event notification has been enabled for a training job, you can click
next to Enabled to modify or disable event notification. - If event notification has not been enabled for a training job, you can click
next to Disabled to enable event notification.
Table 3 Event notification parameters Parameter
Description
Topic
Topic name of event notification. You can select a topic from the drop-down list or click Create now to create a topic on the SMN console.
NOTE:You can create a topic on the SMN console, add a subscription to it, and confirm the subscription status. Once these steps are completed, you will be notified of the event.
Event
Select events you want to subscribe to. Examples: JobStarted, JobCompleted, JobFailed, JobTerminated, and JobHanged.
NOTE:Only training jobs using GPUs or NPUs support JobHanged events.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot

