Creating a Custom Training Job (New Console)
Description
This topic is specific to CN Southwest-Guiyang1. The console uses the new UI version.
Developing models involves optimizing their performance effectively. Traditional methods require repeatedly testing various model structures, datasets, and hyperparameters, which takes significant time and effort but may still fail to deliver good results. ModelArts simplifies this process by offering tools for creating training jobs, tracking progress in real time, and managing versions. With ModelArts, you can test different configurations easily and identify the best-performing setup faster.
Create a production training job in either of the following ways:
- Create a production training job on the ModelArts console. This section provides the operation guide of the new UI. For details about the operation guide of the default UI, see Creating a Training Job (Old Console).
- Use the ModelArts API to create a production training job. For details, see Using PyTorch to Create a Training Job (New-Version Training).
Constraints
- Supported region: This feature is only available in the CN Southwest-Guiyang1 region.
- Job quota: By default, you can create up to 10,000 training jobs.
- Storage: ModelArts does not support OBS buckets with bucket encryption enabled. Ensure this option is disabled when creating your OBS bucket.
Prerequisites
- Account not in arrears (paid resources required for training jobs).
- Data for training uploaded to an OBS directory.
- At least one empty folder in OBS for storing training output.
ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.
- OBS directory and ModelArts in the same region.
- Access authorization configured. If you have not yet configured access, follow the instructions in Configuring Agency Authorization for ModelArts with One Click.
Billing
Model training in ModelArts uses compute and storage resources, which are billed. Compute resources are billed for running training jobs. Storage resources are billed for storing data in OBS or SFS. For details, see Model Training Billing Items.
Procedure
To create a training job, follow these steps:
Step 1: Accessing the Creation Page: Log in to the console and navigate to the training job list.
Step 2: Choosing the Training Mode: Configure the training mode.
Step 3: Setting Basic Information: Define the job name, description, and other basic details.
Step 4: Defining Training Configuration: Configure parameters such as the image, boot command, and environment variables.
Step 5: Configuring Resources: Specify the resource pool type, specifications, number of instances, storage mounts, job priority, and preemption settings.
Step 6: Configuring Data: Configure a dataset.
Step 7: Publishing Models to Assets: Configure whether to publish the trained model to assets.
Step 8: Configuring HA: Configure automatic restart policies (including unconditional restarts and restarts upon job suspensions).
Step 9: Managing Access Configuration: Configure debugging options, SSH remote development, and password-free SSH between nodes.
Step 10: Enabling Observability: Configure TensorBoard, MindStudio Insight, and Prometheus metric collection.
Step 11: Adjusting Additional Configurations: Configure logging, job visibility, automatic stop, event notifications, and tags.
Step 12: Submitting and Viewing the Job: Submit the job and view the training job details.
Step 1: Accessing the Creation Page
- Log in to the ModelArts console.
- In the navigation pane, choose Model Build > Training.
- Click Create Training Job. The new UI is displayed by default. The following describes how to create a training job on the new UI.
Step 2: Choosing the Training Mode
| Training Mode | Description |
|---|---|
| Fine-Tuning | Ideal for scenarios where you need to fine-tune existing pre-trained models, such as Pangu models or ResNet. |
| Custom Job | Designed for scenarios requiring full control over the training workflow, including the use of proprietary code or specialized images. This high-flexibility training allows you to customize training with custom Docker images and algorithms and enjoy full control over the training workflow. For this example, select Custom Job. |
Step 3: Setting Basic Information
| Parameter | Description |
|---|---|
| Name | Job name, which is mandatory. The system automatically generates a name, which you can then rename according to the following rules.
|
| Description (Optional) | Job description, which helps you learn about the job information in the training job list. Enter 0 to 256 characters. Only letters, digits, spaces, hyphens (-), underscores (_), commas (,), and periods (.) are supported. |
Step 4: Defining Training Configuration
| Parameter | Description |
|---|---|
| Preset Template (Optional) | Click Select Preset Template to filter templates by type (currently supporting text generation and image understanding) or brand (currently supporting Qwen). After you select a preset template, some templates will automatically fill in the description (optional), image, boot command code directory, local code directory, and environment variables of the current job. Refer to the actual GUI for final details. You can also adjust the configuration as needed. |
| Select Image | Specifies the container image used to run the training code. The following options are available: Preset Images: Ready-to-use images provided by ModelArts that include popular frameworks (e.g., PyTorch 1.8, TensorFlow 2.1). Ideal for most standard scenarios. Custom Images: Select an image that you have created and pushed to the SWR image repository or a registered image.
Custom images must be registered in ModelArts Image Management before use. This option is recommended when preset base images do not meet specific dependency requirements. |
| Boot Command | Defines the command executed upon container startup to launch your training script.
The boot command supports multiple commands concatenated with ; or &&. Note that demo-code represents the leaf directory of the OBS path where your code is stored; adjust this according to your actual project structure. NOTE: To ensure data security, do not include sensitive information such as plaintext passwords. |
| Code Directory (Optional) | Specifies the OBS directory containing the training code. This parameter is required when using a preset image and optional when using a custom image. You can choose your own OBS bucket or enter a path. The path must start with obs:// and end with a slash (/), like this: obs://bucketname/path/. For shared buckets from other users, you must enter the path. In the OBS bucket, files with the .txt, .py, .sh, and .yaml extensions can be edited online, and files with the .log, .json, and .md extensions can be viewed online.
|
| Code Backup Directory (Optional) | Specifies the OBS directory where you want to back up the training code file.
|
| Local Code Directory | Specifies the local directory within the training container where the code will be downloaded. The default path is /home/ma-user/modelarts/user-job-dir. The path cannot be set to /home/ma-user or any subdirectory under /home/ma-user/modelarts/*, /home/ma-user/modelarts-dev/*, or /home/ma-user/infer/*. Click Preview Runtime Environment to view the actual working directory of the training job. |
| Environment Variable | Allows you to add custom environment variables based on service requirements. For predefined environment variables in the training container, see Managing Environment Variables of a Training Container.
NOTE: To ensure data security, do not include sensitive information such as plaintext passwords. |
Step 5: Configuring Resources
| Parameter | Description |
|---|---|
| Source of resources |
|
| Resource Pool | This parameter appears only for dedicated resource pools. In the Resource Pool section, click Select Resource Pool and choose your desired dedicated resource pool or logical subpool from the menu on the right. Click OK. You can view the dedicated resource pool name, node pool specifications, number of available nodes/maximum number of nodes, number of available NPU/GPUs, available CPUs (vCPUs), available memory (GiB), and resource fragments. Hover over View in the Resource Fragment column to check fragment details and check whether the resource pool meets the training requirements. Once you choose a resource pool, its details appear. To choose a different one, click Reselect. |
| Specification Type | Displays when you select a dedicated resource pool. The following specifications types are supported:
Figure 1 Specifications |
| Specifications | Determines the hardware specifications for the training instances. For Dedicated resource pool, you must select a pool first. For Public resource pool, select a specification directly from the list.
|
| Compute Nodes | Select the number of instances as required. The default value is 1.
|
| Specify Affinity Nodes | Supported only for dedicated resource pools. It allows you to configure supernode and node affinity for training jobs. Select the checkbox to enable it. When enabled, it allows fine-grained control over pod deployment strategies, including: strict placement (strong affinity), preferred placement (weak affinity), prohibited placement (strong anti-affinity), and avoided placement (weak anti-affinity). Affinity Type: Node affinity: Requires all instances of a training job to be scheduled on selected nodes, either strictly or preferentially. Node anti-affinity: Requires all instances of a training job to be avoided or strictly excluded from selected nodes. Strength: The degree of affinity. Weak: The system will try to place the pod on the specified node, but it is not guaranteed. Strong: The pod must be scheduled onto the specified node; otherwise, scheduling will not proceed. Supernode Affinity Method: Supported only for supernode resource pools. At the supernode level, currently only scenarios where all instances of a training job belong to one affinity group are supported. This is suitable for training jobs where traffic must not cross supernodes. Random child nodes: The system randomly schedules tasks to child nodes within the target supernode. Specify child nodes: The system schedules tasks to the specified child nodes. Select Supernode: Choose the supernode(s) to be configured. Supported only for supernode resource pools. Select Node: Choose the node(s) to be configured. |
| Storage Mounting | Enables mounting high-performance storage to improve data access efficiency. Dedicated resource pools support multiple options. For details, see Table 1.
|
| Job Scheduling Priority |
|
| Preemption |
|
Step 6: Configuring Data
| Parameter | Description |
|---|---|
| Training Dataset | Training datasets are used to improve model performance on specific tasks. You can select up to three datasets for training. You can select both Preset Data and My Data. Preset Data: template data officially provided by the platform. My Data: personal data uploaded by you. If existing data assets cannot be selected, go to Asset Management > Data to publish them. |
Step 7: Publishing Models to Assets
Step 8: Configuring HA
| Parameter | Description |
|---|---|
| Fault Tolerance and Recovery | Specifies whether to enable automatic restart for the training job.
|
| Maximum Restarts | This parameter is available when Fault Tolerance and Recovery is selected. The training job will stop if it is still abnormal after maximum automatic restarts.
The value cannot be changed once the training job is created. Set this parameter based on your needs. |
| Unconditional Auto Restart | This parameter is available when Fault Tolerance and Recovery is selected. If Unconditional auto restart is selected, the training job will be restarted unconditionally once the system detects a training exception. To prevent invalid restarts, it supports a maximum of three consecutive unconditional restarts. |
| Restart Upon Suspension | This parameter is available when Fault Tolerance and Recovery is selected. ModelArts continuously monitors job processes to detect suspension and optimize resource usage. When this feature is enabled, suspended jobs can be automatically restarted at the process level. CPU specifications do not support job restarts upon suspension. However, ModelArts does not verify code logic, and suspension detection is periodic, which may result in false reports. By enabling this feature, you acknowledge the possibility of false positives. To prevent unnecessary restarts, ModelArts limits consecutive restarts to three. |
Step 9: Managing Access Configuration
| Parameter | Description |
|---|---|
| JupyterLab | Enables online debugging and development via integrated tools such as JupyterLab. |
| Remote SSH | Allows remote connection to training job instances from a local IDE for real-time debugging and execution. The system automatically starts the SSHD service for each instance and configures SSH passwordless login between instances to facilitate cross-node collaboration. Requires a key pair to be created. If enabled, Password-free SSH Between Nodes will be unavailable. |
| Password-free SSH Between Instances | Specifies whether to generate SSH passwordless mutual trust files between instances.
|
Step 10: Enabling Observability
| Parameter | Description |
|---|---|
| TensorBoard | TensorBoard is a visualization tool package of TensorFlow. It provides visualization functions and tools required for machine learning experiments. With TensorBoard, computational graph during training, metric trends, and data used during training are effectively displayed. For details about TensorBoard, see the official website. This parameter is not supported when a public resource pool is used. Stores the results generated by the visualization tool TensorBoard. |
| MindStudio Insight | MindStudio Insight visualizes information such as scalars, images, computational graphs, and model hyperparameters during training. It supports training jobs based on the MindSpore engine. For details about MindStudio Insight, see MindSpore official website. This parameter is not supported when a public resource pool is used. Stores the results generated by the visualization tool MindStudio Insight. |
| Interconnect Metrics with AOM | Specifies whether to enable Prometheus metrics collection. Configure parameters in your training container to collect Prometheus metrics. Once set up, the system periodically gathers metric data during training and uploads it to AOM, allowing you to monitor custom Prometheus metrics via the AOM console. ModelArts provides two configuration methods.
|
Step 11: Adjusting Additional Configurations
| Parameter | Description |
|---|---|
| Persistent Log Saving | This function is enabled by default when Ascend specifications are selected. This function is available when CPU or GPU specifications are selected.
|
| Log Path | When Persistent Log Saving is enabled, you must configure a log path to store log files generated by the training job. Ensure that you have read and write permissions to the selected OBS directory. You can choose your own OBS bucket or enter a path. The path must start with obs:// and end with a slash (/), like this: obs://bucketname/path/. For shared buckets from other users, you must enter the path. |
| Job Visibility | The options are Workspace and Creator.
|
| Auto Stop | Choose whether to enable Auto Stop.
|
| Retention Period | Choose if you want to keep the on-site training container environment after a successful or failed job creation, and set how long to keep it.
Note: You will still be charged for the training environment during the retention period. Set the retention time based on your needs. |
| Event Notification | Choose whether to enable event notification for the training job.
NOTE:
|
| Tags | TMS's predefined tags are recommended for adding the same tag to different cloud resources. For details about how to use tags, see Using TMS Tags to Manage Resources by Group. You can add up to 20 tags to a training job. |
Step 12: Submitting and Viewing the Job
After setting the parameters, click Submit.
A training job runs for a period of time. You can go to the training job list to view the basic information about the training job.
- In the training job list, Status of a newly created training job is Pending.
- Once the training job shows Completed, it has finished. The system saves the created model in model assets for later access.
- If the status is Failed or Abnormal, click the job name to go to the job details page and view logs for troubleshooting.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot
