Viewing Training Jobs and Details (New Console)

Description

This topic is specific to CN Southwest-Guiyang1. The console uses the new UI version.

When using ModelArts model training, you may need to view the list of ongoing training jobs and their details to ensure the training process is proceeding smoothly.

Through the job list page, you can easily monitor the status of each job, customize the displayed content, filter by required attributes, and quickly locate specific training jobs.

Click the target job name to view its details.

Viewing Training Jobs

Log in to the ModelArts console.
In the navigation pane, choose Model Build > Training.
- On the displayed page, you can view key information such as the job name/ID, status, priority, training mode, progress, duration, tags, creation time, creator, and description. Some columns support filtering or sorting. Click on the right of the job search box to set and adjust the content displayed in the job list.
- You can select All, Mine, or a custom time range to quickly filter the visible job list.
- In the search box above the training job list, you can filter training jobs by attribute type, such as name, ID, status, and training mode.
- In the Operation column, you can clone, terminate, or delete a specific job.
In the training job list, filter jobs by Training Mode (e.g., Fine-tuning or Custom). Click the target job name to view its details.

Viewing Details About a Fine-Tuning Job

In the Model Training list, filter the Training Mode for Fine-tuning and click a job name to access the fine-tuning job details page. The model training details include the model output, task details, events, logs, and resource usage. In the upper right corner, you can perform actions such as Clone, Delete, or Retry based on the current model status.

The following sections describe the features available on each tab:

Model output
The following model parameters can be viewed:

Final model asset: Name of the model generated after training. If the model name changes, this asset name will be updated accordingly.

Training type and objectives: Shows the specific fine-tuning method and objectives used for the job.

Training loss: Displays a loss curve graph. This visualization helps you evaluate whether the model meets your requirements.

Model asset information: Once fine-tuning is complete, the model must be published as an asset to be used in downstream tasks. You can click View assets to go to the details page of the fine-tuned model. For different versions of the same model, you can view the versions that have been published as assets on the model details page.
Task details
This tab displays the configuration settings used for the fine-tuning job, including basic information, dataset information, training configuration, resource configuration, HA configuration, system configuration, and workflow node details. For details, see Step 2: Configuring Fine-Tuning Parameters.
Figure 1 Task details
Events
This tab provides a timeline of events reported during job execution. This helps you monitor the progress of different training stages and detect potential anomalies early.

Figure 2 Events
Logs
This tab displays critical runtime output. Use these logs to monitor the model fine-tuning process in detail and to troubleshoot or debug issues during execution. You can select different nodes and view their logs.

Figure 3 Logs
Resource usage
This tab provides monitoring of metrics during the job run, including the usage of CPU, NPU, and file systems. You can select different nodes and view their resource usage.
Figure 4 Resource usage
Tags
Displays the tags set for the fine-tuning job. Tags can also be edited.

Figure 5 Tags

Viewing Details About a Custom Job

In the training job list, filter the Training Mode for Custom and click a job name to access the custom job details page.

The following sections describe the features available on each tab.

**Table 1** Features available on each tab
Category	Description
Train Details	Basic job information.
Event	Managing training event notifications on the training details page Viewing training job events Throughout the entire lifecycle of a training job, starting from the stage visible to you, the system backend records every key event point. You can view these records at any time on the details page of the corresponding training job. This allows you to clearly understand the progress and status of the training job, ensuring information transparency and traceability.
Log	Training logs record the execution process and error information of training jobs, helping you quickly locate issues that occur during operation. Both standard output and standard error from your code are displayed in the training logs.
Cloud Shell	ModelArts provides Cloud Shell, which allows you to log in to a running container to debug training jobs in the production environment.
Monitoring	You can use the monitoring feature to view the resource usage of training jobs and metrics of training jobs to quickly learn about the status of training jobs. Viewing the resource usage of a training job In the Monitoring tab of the training job details page, you can view the CPU, GPU, and NPU usage for the job or a single node. Viewing monitoring metrics of a training job Receiving and promptly addressing alarms during a training job (for example, abnormal loss values) can save significant time and resources, preventing the waste caused by invalid job runs. Additionally, metric monitoring allows you to track the training job's progress in real time and the model's training status across different phases.
Intelligent O&M	ModelArts monitors training jobs in real time for smooth operations. The training job details page includes intelligent O&M tools for easy monitoring and maintenance.
Evaluation Results	After a training job has been executed, ModelArts evaluates your model and provides optimization diagnosis and suggestions.
Tags	You can add tags to a training job for quick search.

Training Job Details

The job details page displays the basic information about a job.

**Table 2** Task details
Category	Parameter	Description
Basic Information	Name	Name of the training job.
	ID	Unique ID of the training job.
	Status	Status of the training job. The statuses include Completed, Completed (retained), Pending, Running, Creating, Terminating, Terminated, Failed, Failed (retained), Abnormal, and Deleting.
	Created	Time when the training job is created.
	Duration	Duration of a training job, which is the total duration of Kubernetes resources in the entire lifecycle of a training job.
	Description	Description of the training job. When left unset, -- appears. Click to edit the training job's description.
Training configuration	Image Type	Type of the image selected for the training job. Preset images and custom images are supported.
	Image	Name of the image selected for the training job.
	Image Address	SWR address of the image.
	Code Directory	OBS path to the code directory of the training job. If this parameter is not configured, -- appears. You can click to update the code.
	Code Backup Directory	OBS backup directory containing the contents of the training job code directory.
	Local Code Directory	Path to the training code in the training container.
	Boot Command	Command for booting an image.
	Environment Variable	Environment variables set for a training job.
Resource configuration	Resource Type	Type of the resource pool selected for the training job. You can select a dedicated resource pool or a public resource pool.
	Resource Pool	Name of the resource pool selected for the training job. This parameter is only available when the training job uses a dedicated resource pool.
	Specifications	Specifications used in the training job. This parameter shows the instance specifications for the training job, both allocated to the training containers and chosen during job creation. Target Specifications: The compute resources set during job setup, like CPUs and memory. Actual Specifications: The resources the platform assigns to the training container while the job runs. Based on the target specifications, the platform reserves necessary resources for the OS, Kubernetes system components, and resource pool plugins. Consequently, the resources actually available to you are typically less than the target specifications.
	Compute Nodes	Number of instances for the training job.
	Compute Node ID	By default, the number of current compute nodes is displayed. Click to display names and IP addresses of the compute nodes used by the training job. This parameter is only displayed when the training job uses a dedicated resource pool.
	Job Scheduling Priority	Priority of a training job created using dedicated resource pool. If a training job is created using a public resource pool, this parameter is not displayed. The platform handles jobs by prioritizing them from highest to lowest. If multiple jobs share the same priority, they are scheduled in the order they were submitted. When resources are available, the earliest-submitted job gets processed first. The priority can be set to 1, 2, or 3. A larger number indicates a higher priority. The default priority is 1, and the highest priority is 3. If a training job is in the Pending state for a long time, you can change the job priority to reduce the queuing duration. For details, see Priority of a Training Job.
	Preemption	When using a dedicated resource pool, you can set this parameter. This parameter is not displayed when a public resource pool is used. When enabled, jobs that allow preemption may be terminated and re-queued if resource pool capacity is insufficient. To avoid losing training progress, configure resumable training before enabling this function. For details, see Resumable Training. Disabled is displayed when it is not set.
Data Configuration	Training Dataset	Name of the dataset used for the training job. If it is not configured or is not enabled, Disabled is displayed.
HA Settings	Fault Tolerance and Recovery	Number of times that the training job automatically restarts upon a fault. This parameter is only available when Fault Tolerance and Recovery is enabled during training job creation. The maximum number of restarts and the number of restarts are displayed. If it is not configured or is not enabled, Disabled is displayed.
	Unconditional Auto Restart	Displays when Fault Tolerance and Recovery is enabled. When Unconditional Auto Restart is enabled during job creation, Open is displayed. If it is not configured or is not enabled, Disabled is displayed.
	Restart Upon Suspension	Displays when Fault Tolerance and Recovery is enabled. When Restart Upon Suspension is enabled during job creation, Open is displayed. If it is not configured or is not enabled, Disabled is displayed.
Publish to Assets	Publish to Assets	When it is enabled, Enabled is displayed. The system will automatically publish model artifacts as assets, enabling operations such as inference and evaluation on the platform. If it is not configured or is not enabled, Disabled is displayed.
	Model Output Path	The storage path and cloud mount path are separated by a vertical bar (\|). Directory: location where the model is stored after training is complete. Mount Path: This parameter is displayed when you use a dedicated resource pool. The system mounts the file directory in the storage location to the specified path in the training container. You can customize this path, but system directories such as /home/, /home/ma-user/, and /home/ma-user/modelarts/ are not supported. If this parameter is not configured, -- is displayed.
	Auto-publish to Assets	When it is enabled, Enabled is displayed. The trained model will be automatically uploaded to the Asset Management > Models > My Models page. If it is not configured or is not enabled, Disabled is displayed.
	Model Name	Displays when Auto-publish to Assets is enabled. Name of the new model. Enter 2 to 128 characters. Only letters, digits, hyphens (-), and underscores (_) are allowed. The name must start with a letter and end with a letter or digit.
	Model Type	Displays when Auto-publish to Assets is enabled. Model type of the published model.
	Model Brand	Displays when Auto-publish to Assets is enabled. Model brand.
	Model Version	Displays when Auto-publish to Assets is enabled. If the model is published as a new model, the version number is V1. If the model is published as a new version of an existing model, the version number is automatically incremented by 1 based on the previous version number of the model. Note: The model version number cannot be modified and is automatically generated by the system.
	Description	Displays when Auto-publish to Assets is enabled. Description of the trained model. This field is optional and can contain a maximum of 256 characters.
Access Settings	JupyterLab	JupyterLab address of the training job. This parameter is available only when JupyterLab is selected for Training Application. If it is not configured, Disabled is displayed.
	Remote SSH	Key pair and SSH address for SSH remote development of the training job. This parameter is available only for training jobs with remote SSH enabled. If it is not configured, Disabled is displayed.
	Password-free SSH Between Instances	Information about the password-free SSH file configured for the training job. If it is not configured, Disabled is displayed.
Observability Settings	TensorBoard	TensorBoard is a visualization tool package of TensorFlow. It provides visualization functions and tools required for machine learning experiments. With TensorBoard, computational graph during training, metric trends, and data used during training are effectively displayed. For details about TensorBoard, see the official website. This parameter is not displayed when a public resource pool is used. If enabled, the configured storage path is displayed. If it is not configured, Disabled is displayed.
	MindStudio Insight	MindStudio Insight visualizes information such as scalars, images, computational graphs, and model hyperparameters during training. It supports training jobs based on the MindSpore engine. For details about MindStudio Insight, see MindSpore official website. This parameter is not displayed when a public resource pool is used. If enabled, the configured storage path is displayed. If it is not configured, Disabled is displayed.
	Interconnect Metrics with AOM	HTTP-based metrics: For training jobs with this parameter configured, the configured collection URL and port will be displayed. CLI-based metrics: For training jobs with this parameter configured, the configured execution command and command parameters will be displayed. If it is not configured, Disabled is displayed.
More Configurations	Persistent Log Saving	After a log path is configured, the configured path is displayed. If it is not configured, Disabled is displayed.
	Job Visibility	The options are Workspace and Creator. Workspace: The created training job is visible to all users in the current workspace. Creator: Only the creator can view the job by default. To access it, other users must request the modelarts:trainJob:listAll permission, which allows them to view all training jobs, including those limited to the creator.
	Auto Stop	The auto stop time configured for the training job is displayed. The options are 1 hour, 2 hours, 4 hours, 6 hours, or custom. The custom value ranges from 1 to 720 hours. When you enable this function, the training stops automatically when the time limit is reached. The time limit does not count down when the training is paused. If it is not configured, Disabled is displayed.
	Retention Period	The retention period configured for the training job is displayed. The options are 1 hour, 2 hours, 4 hours, 6 hours, or custom. The custom value ranges from 1 to 720 hours. When you enable this parameter and set the duration, the training environment stays active for that time after the job succeeds or fails. The waiting time in the queue does not count toward this duration. If it is not configured, Disabled is displayed.
	Event Notification	Topic and events set for event notification during training job creation. If it is not configured, Disabled is displayed.

On the training details page, manage event notifications of the training job.

Event notifications cannot be configured for training jobs in the Completed, Failed, Abnormal, or Terminated state.
To set up event notifications, you need permission to view jobs.
Only the updated training status is notified for modification events.

After event notification is enabled, you will be notified of a specific event, such as a job status change or suspected suspension, through an SMS message or email. Notifications will be billed based on SMN pricing. For details, see Billing.

If event notification has been enabled for a training job, you can click next to Enabled to modify or disable event notification.
If event notification has not been enabled for a training job, you can click next to Disabled to enable event notification.

**Table 3** Event notification parameters
Parameter	Description
Topic	Topic name of event notification. You can select a topic from the drop-down list or click Create now to create a topic on the SMN console. NOTE: You can create a topic on the SMN console, add a subscription to it, and confirm the subscription status. Once these steps are completed, you will be notified of the event.
Event	Select events you want to subscribe to. Examples: JobStarted, JobCompleted, JobFailed, JobTerminated, and JobHanged. NOTE: Only training jobs using GPUs or NPUs support JobHanged events.

Parent topic: Managing Model Training Jobs

Previous topic: Training Dashboard Monitoring

Next topic: Viewing Training Jobs and Details (Old Console)

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot