Viewing Monitoring Metrics of a Training Job
Receiving and promptly addressing alarms during a training job (for example, abnormal loss values) can save significant time and resources, preventing the waste caused by invalid job runs. Additionally, metric monitoring allows you to track the training job's progress in real time and the model's training status across different phases.
In the Monitoring tab of the training job details page of ModelArts, you can view the CPU, GPU, or NPU usage of training jobs. For details, see Viewable Metrics on the ModelArts Console.
For more metrics, go to the AOM console. For details, see Viewing All ModelArts Monitoring Metrics on the AOM Console.
With ModelArts, you can track custom metrics in AOM, such as loss values, step durations, and GPU throughput. These metrics are displayed in your training logs, making it easy to monitor trends and compare results across different jobs.
- This method helps you simplify operations and reduce code changes.
- This method works well for users familiar with the Prometheus ecosystem and who already have collection tools.
- This method works for users who do not want to show metric parsing.
- This method works well for users familiar with the Prometheus ecosystem and who already have collection tools.
- This method works well for users who need to add custom metrics, like adding custom dimensions or complex calculations.
- This method works well for complex service logic and when you need to control metric reporting in your code.
- This method works well for users who have experience with Huawei Cloud SDKs and know about cloud services.
Viewable Metrics on the ModelArts Console
Table 1 shows the metrics that can be viewed on the training job details page of the ModelArts console.
Table 1 shows the training metrics that can be viewed on the Overview page of the ModelArts console.
| Metric | Description | How to View |
|---|---|---|
| Training job resource usage | CPU, GPU, or NPU usage of a training job. | ModelArts console > Overview > Resource Usage of Training Jobs |
| Card-hours | Running duration and number of cards used by a training job. | ModelArts console > Overview > Resource Usage of Training Jobs
|
Collecting Prometheus Metrics for AOM Over HTTP
Configure the collection process and HTTP API to let the platform obtain Prometheus metric data and send it to the AOM console automatically. This solution lets you set up metric collection and exposure without manual uploads. It is ideal for monitoring training job performance in real time.
Follow these restrictions when using this solution:
- Metric format: Custom metrics must follow the cloud native exporter's standard format. If not, data might not be parsed correctly. Metric format example:
mspti_marker_range_cost_time{name="Step: 0", source_kind="host", process_id="1887317", thread_id="1887317"} 261981960 1732357956859823920 mspti_marker_range_cost_time{name="Step matmul: 0", source_kind="host", process_id="1887317", thread_id="1887757"} 2366640 1732357956863054960 - Data volume: If too much metric data (more than 32 KB) is reported within 10 seconds, data may be lost.
- Metric reporting frequency: If the metric reporting happens more often than every millisecond, some metrics might get lost because of repeated timestamps.
- Start the metric collection process in the training container. Start a separate process to collect custom metrics outside the training process. For example, you can use the Flask framework to create a simple HTTP server.
from flask import Flask app = Flask(__name__) @app.route('/metrics') def get_metrics(): # Generate or obtain custom metric data. metrics = """ mspti_marker_range_cost_time{name="Step: 0", source_kind="host", process_id="1887317", thread_id="1887317"} 261981960 1732357956859823920 mspti_marker_range_cost_time{name="Step matmul: 0", source_kind="host", process_id="1887317", thread_id="1887757"} 2366640 1732357956863054960 """ return metrics if __name__ == '__main__': app.run(host='0.0.0.0', port=8000) - Configure an HTTP API to ensure that the metric collection process provides an HTTP API (for example, /metrics) so that the training platform can periodically obtain metric data.
After each HTTP API call, promptly clear the collected metrics to avoid memory leaks or performance issues.
- Enable Prometheus metric collection.
When creating a training job, enable Prometheus Metrics Collection, set Metrics Collection Method to HTTP, and configure the collection URL and port. The parameter settings must meet the following requirements:
- Make sure the URL and port number match those of the metric collection process. If they do not, metrics might not report correctly.
- Ensure that the network environment where the training job is located allows access to the configured URL and port number to prevent metric collection failures due to network issues.
- View custom monitoring metrics on AOM.
When a training job is running, log in to the AOM console and view the custom Prometheus metrics on the Metric Browsing page.
Collecting Prometheus Metrics for AOM Using Commands
Configure the command and its parameters to let the platform obtain Prometheus metric data and send it to the AOM console automatically. This solution lets you set up metric collection and exposure without manual uploads. You can fully customize the metric parsing process. It is ideal for monitoring training job performance in real time.
Follow these restrictions when using this solution:
- Metric format: The custom metric data must be in text format. Each metric should follow this format: <Metric name>{<Label name>=<Label value>,...} <Sampling value> [Millisecond timestamp].
# HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total gauge html_http_requests_total{method="post",code="200"} 1656 1686660980680 html_http_requests_total{method="post",code="400"} 2 1686660980681 - Data volume: If too much metric data (more than 32 KB) is reported within 10 seconds, data may be lost.
- Metric reporting frequency: If the metric reporting happens more often than every millisecond, some metrics might get lost because of repeated timestamps.
- Create a custom metric file, like test.prom. Store this file in your training image or code, ensuring it is accessible to the training container.
- Enable Prometheus metric collection.
When creating a training job, enable Prometheus Metric Collection, set Metrics Collection Method to Command Line, and configure the following parameters:
- Command: Enter the Linux command for reading metrics, for example, cat.
- Command Parameters: Enter the path to the custom metric file, for example, /XXX/a.prom.
Make sure the command and its parameters can reliably produce metrics with responses in seconds. If not, metric collection might fail.
- View custom monitoring metrics on AOM.
When a training job is running, log in to the AOM console and view the custom Prometheus metrics on the Metric Browsing page.
Reporting Custom Metrics to AOM Using SDK
Integrate the SDK into your code to upload metric data to AOM manually.
This method works well for users who need to add custom metrics, like adding custom dimensions or complex calculations. It also works well for complex service logic and when you need to control metric reporting in your code.
- Add the metric monitoring code to the training code. You can refer to the sample code below. For details about other requirements for preparing training code, see Preparing Model Training Code.
Replace the region value cn-southwest-2 in the last but one line of the code with the actual region. For details about the region values, see Endpoints.
Add monitoring metrics to the code. For details about the parameters, see the AOM documentation.
# coding: utf-8 import os from huaweicloudsdkaom.v2 import * from huaweicloudsdkaom.v2.region.aom_region import AomRegion from huaweicloudsdkcore.auth.credentials import BasicCredentials from huaweicloudsdkcore.exceptions import exceptions from moxing.framework import cloud_utils def report2Aom(request,region): auth = cloud_utils.get_auth() # AK, SK, and temporary token, which are automatically obtained by the system. ak = auth.AK sk = auth.SK securityToken = auth.TOKEN projectId = os.environ.get("MA_IAM_PROJECT_ID") credentials = BasicCredentials(ak, sk, projectId).with_security_token(securityToken) client = AomClient.new_builder() \ .with_credentials(credentials) \ .with_region(AomRegion.value_of(region)) \ .build() try: response = client.add_metric_data(request) print(response) except Exception as e: print(e) if __name__ == "__main__": request = AddMetricDataRequest() listValuesBody = [ # Enter the metric name, type, unit, and value, such as step_time and loss value. ValueData( metric_name="step_time", # Monitoring metric name, for example, step_time type="float", # Data type of the metric. The value can only be int or float. unit="ms", # Data unit, for example, ms. The value contains a maximum of 32 characters. value=135.572 # Value of the metric data. The value must be of a valid numeric type. The minimum value is 0. ), ValueData( metric_name="loss", type="float", value=0.6932 ) ] listDimensionsMetric = [ # Enter the metric dimensions you want to view, such as thread and host. Dimension2( name="cluster_name",# This is only an example. Replace it with the actual metric dimension you want to view. value="fab2c5cf438b4f0c851fdcdf"# This is only an example. Replace it with the actual parameter value. ), Dimension2( name="user_name", value="modelarts_02" # This is only an example. Replace it with the actual parameter value. ), Dimension2( name="user_id", value="04f258c8fb00d42a1f6xxx" # This is only an example. Replace it with the actual parameter value. ) ] metricBody = MetricItemInfo( dimensions=listDimensionsMetric, namespace="NOPAAS.ESC" # Retain the default value. ) listBodybody = [ MetricDataItem( collect_time=int(round(time.time()*1000)), # Time when monitoring metric data is collected. The value is the latest timestamp, in milliseconds. metric=metricBody, values=listValuesBody ) ] request.body = listBodybody region = "cn-southwest-2" # Replace the value with the actual region. response = report2Aom(request,region) - Add the commands below to the training code to load the required dependency packages. If a custom image is used, you can also install the dependencies during image creation. For details, see Developing Code for Training Using a Custom Image.
pip install huaweicloudsdkaom pip install huaweicloudsdkcore
- Create and run a training job. For details, see Creating a Training Job.
- Log in to the AOM console. On the Metric Browsing page, view the reported metric data by specifying metrics.
- Configure AOM alarm and notification rules. For details, see Configuring Alarm Settings.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot
