Viewing Monitoring Metrics of a Training Job
Receiving and promptly addressing alarms during a training job (for example, abnormal loss values) can save significant time and resources, preventing the waste caused by invalid job runs. Additionally, metric monitoring allows you to track the training job's progress in real time and the model's training status across different phases.
In the Monitoring tab of the training job details page of ModelArts Standard, you can view the CPU, GPU, or NPU usage of training jobs. For details, see Viewable Metrics on the ModelArts Console.
For more metrics, go to the AOM console. For details, see Viewing All ModelArts Monitoring Metrics on the AOM Console.
- Reporting Custom Metrics to AOM Using SDK: Integrate the SDK into your code to upload metric data to AOM manually.
- This method works well for users who need to add custom metrics, like adding custom dimensions or complex calculations.
- This method works well for complex service logic and when you need to control metric reporting in your code.
- This method works well for users who have experience with Huawei Cloud SDKs and know about cloud services.
Viewable Metrics on the ModelArts Console
Table 1 shows the metrics that can be viewed on the training job details page of the ModelArts console.
Table 1 shows the training metrics that can be viewed on the Overview page of the ModelArts console.
Metric |
Description |
How to View |
---|---|---|
Training job resource usage |
CPU, GPU, or NPU usage of a training job. |
ModelArts console > Overview > Resource Usage of Training Jobs |
Card-hours |
Running duration and number of cards used by a training job. |
ModelArts console > Overview > Resource Usage of Training Jobs |
Reporting Custom Metrics to AOM Using SDK
Integrate the SDK into your code to upload metric data to AOM manually.
This method works well for users who need to add custom metrics, like adding custom dimensions or complex calculations. It also works well for complex service logic and when you need to control metric reporting in your code.
- Add the metric monitoring code to the training code. You can refer to the sample code below. For details about other requirements for preparing training code, see Preparing Model Training Code.
Replace the region value cn-southwest-2 in the last but one line of the code with the actual region. For details about the region values, see Endpoints.
Add monitoring metrics to the code. For details about the parameters, see the AOM documentation.
# coding: utf-8 import os from huaweicloudsdkaom.v2 import * from huaweicloudsdkaom.v2.region.aom_region import AomRegion from huaweicloudsdkcore.auth.credentials import BasicCredentials from huaweicloudsdkcore.exceptions import exceptions from moxing.framework import cloud_utils def report2Aom(request,region): auth = cloud_utils.get_auth() # AK, SK, and temporary token, which are automatically obtained by the system. ak = auth.AK sk = auth.SK securityToken = auth.TOKEN projectId = os.environ.get("MA_IAM_PROJECT_ID") credentials = BasicCredentials(ak, sk, projectId).with_security_token(securityToken) client = AomClient.new_builder() \ .with_credentials(credentials) \ .with_region(AomRegion.value_of(region)) \ .build() try: response = client.add_metric_data(request) print(response) except Exception as e: print(e) if __name__ == "__main__": request = AddMetricDataRequest() listValuesBody = [ # Enter the metric name, type, unit, and value, such as step_time and loss value. ValueData( metric_name="step_time", # Monitoring metric name, for example, step_time type="float", # Data type of the metric. The value can only be int or float. unit="ms", # Data unit, for example, ms. The value contains a maximum of 32 characters. value=135.572 # Value of the metric data. The value must be of a valid numeric type. The minimum value is 0. ), ValueData( metric_name="loss", type="float", value=0.6932 ) ] listDimensionsMetric = [ # Enter the metric dimensions you want to view, such as thread and host. Dimension2( name="cluster_name",# This is only an example. Replace it with the actual metric dimension you want to view. value="fab2c5cf438b4f0c851fdcdf"# This is only an example. Replace it with the actual parameter value. ), Dimension2( name="user_name", value="modelarts_02" # This is only an example. Replace it with the actual parameter value. ), Dimension2( name="user_id", value="04f258c8fb00d42a1f6xxx" # This is only an example. Replace it with the actual parameter value. ) ] metricBody = MetricItemInfo( dimensions=listDimensionsMetric, namespace="NOPAAS.ESC" # Retain the default value. ) listBodybody = [ MetricDataItem( collect_time=int(round(time.time()*1000)), # Time when monitoring metric data is collected. The value is the latest timestamp, in milliseconds. metric=metricBody, values=listValuesBody ) ] request.body = listBodybody region = "cn-southwest-2" # Replace the value with the actual region. response = report2Aom(request,region)
- Add the commands below to the training code to load the required dependency packages. If a custom image is used, you can also install the dependencies during image creation. For details, see Developing Code for Training Using a Custom Image.
pip install huaweicloudsdkaom pip install huaweicloudsdkcore
- Create and run a training job. For details, see Creating a Production Training Job.
- Log in to the AOM console. On the Metric Browsing page, view the reported metric data by specifying metrics.
- Set the AOM alarm and notification mechanisms by referring to Configuring Alarm Settings.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot