Viewing Lite Cluster Monitoring Metrics on AOM
Monitoring Existing Metrics
ModelArts periodically collects the usage data of key resources (such as GPUs, NPUs, CPUs, and memory) for each node in the resource pool and reports this data to AOM. You can view the default basic metrics on AOM. The procedure is as follows:
- Log in to the console and search for AOM to go to the AOM console.
- Choose Monitoring > Metric Monitoring. On the Metric Monitoring page that is displayed, click Add Metric.
Figure 1 Example
- Add a metric for query.
- Click Confirm. The metric information is displayed.
Classification |
Name |
Metric |
Description |
Unit |
Value Range |
---|---|---|---|---|---|
CPU |
CPU Usage |
ma_container_cpu_util |
CPU usage of a measured object |
% |
0%–100% |
Used CPU Cores |
ma_container_cpu_used_core |
Number of CPU cores used by a measured object |
Core |
≥0 |
|
Total CPU Cores |
ma_container_cpu_limit_core |
Total number of CPU cores that have been applied for a measured object |
Core |
≥1 |
|
Memory |
Total Physical Memory |
ma_container_memory_capacity_megabytes |
Total physical memory that has been applied for a measured object |
MB |
≥0 |
Physical Memory Usage |
ma_container_memory_util |
Percentage of the used physical memory to the total physical memory |
% |
0%–100% |
|
Used Physical Memory |
ma_container_memory_used_megabytes |
Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set) (Memory usage in a working set = Active anonymous AND cache, and file-baked page ≤ container_memory_usage_bytes) |
MB |
≥0 |
|
Storage |
Disk Read Rate |
ma_container_disk_read_kilobytes |
Volume of data read from a disk per second |
KB/s |
≥0 |
Disk Write Rate |
ma_container_disk_write_kilobytes |
Volume of data written into a disk per second |
KB/s |
≥0 |
|
GPU memory |
Total GPU Memory |
ma_container_gpu_mem_total_megabytes |
Total GPU memory of a training job |
MB |
>0 |
GPU Memory Usage |
ma_container_gpu_mem_util |
Percentage of the used GPU memory to the total GPU memory |
% |
0%–100% |
|
Used GPU Memory |
ma_container_gpu_mem_used_megabytes |
GPU memory used by a measured object |
MB |
≥0 |
|
GPU |
GPU Usage |
ma_container_gpu_util |
GPU usage of a measured object |
% |
0%–100% |
GPU Memory Bandwidth Usage |
ma_container_gpu_mem_copy_util |
GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of NVIDIA GP Vnt1 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%. |
% |
0%–100% |
|
GPU Encoder Usage |
ma_container_gpu_enc_util |
GPU encoder usage of a measured object |
% |
% |
|
GPU Decoder Usage |
ma_container_gpu_dec_util |
GPU decoder usage of a measured object |
% |
% |
|
GPU Temperature |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature |
°C |
Natural number |
|
GPU Power |
DCGM_FI_DEV_POWER_USAGE |
GPU power |
Watt (W) |
>0 |
|
GPU Memory Temperature |
DCGM_FI_DEV_MEMORY_TEMP |
GPU memory temperature |
°C |
Natural number |
|
Network I/O |
Downlink rate |
ma_container_network_receive_bytes |
Inbound traffic rate of a measured object |
Bytes/s |
≥0 |
Packet receive rate |
ma_container_network_receive_packets |
Number of data packets received by a NIC per second |
Packets/s |
≥0 |
|
Downlink Error Rate |
ma_container_network_receive_error_packets |
Number of error packets received by a NIC per second |
Packets/s |
≥0 |
|
Uplink rate |
ma_container_network_transmit_bytes |
Outbound traffic rate of a measured object |
Bytes/s |
≥0 |
|
Uplink Error Rate |
ma_container_network_transmit_error_packets |
Number of error packets sent by a NIC per second |
Packets/s |
≥0 |
|
Packet send rate |
ma_container_network_transmit_packets |
Number of data packets sent by a NIC per second |
Packets/s |
≥0 |
|
NPU |
NPU Usage |
ma_container_npu_util |
NPU usage of a measured object (To be replaced by ma_container_npu_ai_core_util) |
% |
0%–100% |
NPU Memory Usage |
ma_container_npu_memory_util |
Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_container_npu_ddr_memory_util for snt3 series, and ma_container_npu_hbm_util for snt9 series) |
% |
0%–100% |
|
Used NPU Memory |
ma_container_npu_memory_used_megabytes |
NPU memory used by a measured object (To be replaced by ma_container_npu_ddr_memory_usage_bytes for snt3 series, and ma_container_npu_hbm_usage_bytes for snt9 series) |
≥0 |
MB |
|
Total NPU Memory |
ma_container_npu_memory_total_megabytes |
Total NPU memory of a measured object (To be replaced by ma_container_npu_ddr_memory_bytes for snt3 series, and ma_container_npu_hbm_bytes for snt9 series) |
>0 |
MB |
|
AI Processor Error Codes |
ma_container_npu_ai_core_error_code |
Error codes of Ascend AI processors |
N/A |
N/A |
|
AI Processor Health Status |
ma_container_npu_ai_core_health_status |
Health status of Ascend AI processors |
N/A |
|
|
AI Processor Power Consumption |
ma_container_npu_ai_core_power_usage_watts |
Power consumption of Ascend AI processors (processor power consumption for snt9 and snt3, and card power consumption for snt3P) |
Watt (W) |
>0 |
|
AI Processor Temperature |
ma_container_npu_ai_core_temperature_celsius |
Temperature of Ascend AI processors |
°C |
Natural number |
|
AI Core Usage |
ma_container_npu_ai_core_util |
AI core usage of Ascend AI processors |
% |
0%–100% |
|
AI Core Clock Frequency |
ma_container_npu_ai_core_frequency_hertz |
AI core clock frequency of Ascend AI processors |
Hertz (Hz) |
>0 |
|
AI Processor Voltage |
ma_container_npu_ai_core_voltage_volts |
Voltage of Ascend AI processors |
Volt (V) |
Natural number |
|
AI Processor DDR Memory |
ma_container_npu_ddr_memory_bytes |
Total DDR memory capacity of Ascend AI processors |
Byte |
>0 |
|
AI Processor DDR Usage |
ma_container_npu_ddr_memory_usage_bytes |
DDR memory usage of Ascend AI processors |
Byte |
>0 |
|
AI Processor DDR Memory Utilization |
ma_container_npu_ddr_memory_util |
DDR memory utilization of Ascend AI processors |
% |
0%–100% |
|
AI Processor HBM Memory |
ma_container_npu_hbm_bytes |
Total HBM memory of Ascend AI processors (dedicated for Ascend snt9 processors) |
Byte |
>0 |
|
AI Processor HBM Memory Usage |
ma_container_npu_hbm_usage_bytes |
HBM memory usage of Ascend AI processors (dedicated for Ascend snt9 processors) |
Byte |
>0 |
|
AI Processor HBM Memory Utilization |
ma_container_npu_hbm_util |
HBM memory utilization of Ascend AI processors (dedicated for Ascend snt9 processors) |
% |
0%–100% |
|
AI Processor HBM Memory Bandwidth Utilization |
ma_container_npu_hbm_bandwidth_util |
HBM memory bandwidth utilization of Ascend AI processors (dedicated for Ascend snt9 AI processors) |
% |
0%–100% |
|
AI Processor HBM Memory Clock Frequency |
ma_container_npu_hbm_frequency_hertz |
HBM memory clock frequency of Ascend AI processors (dedicated for Ascend snt9 processors) |
Hertz (Hz) |
>0 |
|
AI Processor HBM Memory Temperature |
ma_container_npu_hbm_temperature_celsius |
HBM memory temperature of Ascend AI processors (dedicated for Ascend snt9 processors) |
°C |
Natural number |
|
AI CPU Utilization |
ma_container_npu_ai_cpu_util |
AI CPU utilization of Ascend AI processors |
% |
0%–100% |
|
AI Processor Control CPU Utilization |
ma_container_npu_ctrl_cpu_util |
Control CPU utilization of Ascend AI processors |
% |
0%–100% |
Classification |
Name |
Metric |
Description |
Unit |
Value Range |
---|---|---|---|---|---|
CPU |
Total CPU Cores |
ma_node_cpu_limit_core |
Total number of CPU cores that have been applied for a measured object |
Core |
≥1 |
Used CPU Cores |
ma_node_cpu_used_core |
Number of CPU cores used by a measured object |
Core |
≥0 |
|
CPU Usage |
ma_node_cpu_util |
CPU usage of a measured object |
% |
0%–100% |
|
CPU I/O Wait Time |
ma_node_cpu_iowait_counter |
Disk I/O wait time accumulated since system startup |
jiffies |
≥0 |
|
Memory |
Physical Memory Usage |
ma_node_memory_util |
Percentage of the used physical memory to the total physical memory |
% |
0%–100% |
Total Physical Memory |
ma_node_memory_total_megabytes |
Total physical memory that has been applied for a measured object |
MB |
≥0 |
|
Network I/O |
Downlink Rate (BPS) |
ma_node_network_receive_rate_bytes_seconds |
Inbound traffic rate of a measured object |
Bytes/s |
≥0 |
Uplink Rate (BPS) |
ma_node_network_transmit_rate_bytes_seconds |
Outbound traffic rate of a measured object |
Bytes/s |
≥0 |
|
Storage |
Disk Read Rate |
ma_node_disk_read_rate_kilobytes_seconds |
Volume of data read from a disk per second (Only data disks used by containers are collected.) |
KB/s |
≥0 |
Disk Write Rate |
ma_node_disk_write_rate_kilobytes_seconds |
Volume of data written into a disk per second (Only data disks used by containers are collected.) |
KB/s |
≥0 |
|
Total Cache |
ma_node_cache_space_capacity_megabytes |
Total cache of the Kubernetes space |
MB |
≥0 |
|
Used Cache |
ma_node_cache_space_used_capacity_megabytes |
Used cache of the Kubernetes space |
MB |
≥0 |
|
Total Container Space |
ma_node_container_space_capacity_megabytes |
Total container space |
MB |
≥0 |
|
Used Container Space |
ma_node_container_space_used_capacity_megabytes |
Used container space |
MB |
≥0 |
|
GPU |
GPU Usage |
ma_node_gpu_util |
GPU usage of a measured object |
% |
0%–100% |
Total GPU Memory |
ma_node_gpu_mem_total_megabytes |
Total GPU memory of a measured object |
MB |
>0 |
|
GPU Memory Usage |
ma_node_gpu_mem_util |
Percentage of the used GPU memory to the total GPU memory |
% |
0%–100% |
|
Used GPU Memory |
ma_node_gpu_mem_used_megabytes |
GPU memory used by a measured object |
MB |
≥0 |
|
Tasks on a Shared GPU |
node_gpu_share_job_count |
Number of tasks running on a shared GPU |
Number |
≥0 |
|
GPU Temperature |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature |
°C |
Natural number |
|
GPU Power |
DCGM_FI_DEV_POWER_USAGE |
GPU power |
Watt (W) |
>0 |
|
GPU Memory Temperature |
DCGM_FI_DEV_MEMORY_TEMP |
GPU memory temperature |
°C |
Natural number |
|
NPU |
NPU Usage |
ma_node_npu_util |
NPU usage of a measured object (To be replaced by ma_node_npu_ai_core_util) |
% |
0%–100% |
NPU Memory Usage |
ma_node_npu_memory_util |
Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_node_npu_ddr_memory_util for snt3 series, and ma_node_npu_hbm_util for snt9 series) |
% |
0%–100% |
|
Used NPU Memory |
ma_node_npu_memory_used_megabytes |
NPU memory used by a measured object (To be replaced by ma_node_npu_ddr_memory_usage_bytes for snt3 series, and ma_node_npu_hbm_usage_bytes for snt9 series) |
MB |
≥0 |
|
Total NPU Memory |
ma_node_npu_memory_total_megabytes |
Total NPU memory of a measured object (To be replaced by ma_node_npu_ddr_memory_bytes for snt3 series, and ma_node_npu_hbm_bytes for snt9 series) |
MB |
>0 |
|
AI Processor Error Codes |
ma_node_npu_ai_core_error_code |
Error codes of Ascend AI processors |
N/A |
N/A |
|
AI Processor Health Status |
ma_node_npu_ai_core_health_status |
Health status of Ascend AI processors |
N/A |
|
|
AI Processor Power Consumption |
ma_node_npu_ai_core_power_usage_watts |
Power consumption of Ascend AI processors (processor power consumption for snt9 and snt3, and card power consumption for snt3P) |
Watt (W) |
>0 |
|
AI Processor Temperature |
ma_node_npu_ai_core_temperature_celsius |
Temperature of Ascend AI processors |
°C |
Natural number |
|
AI Processor Fan Speed |
ma_node_npu_fan_speed_rpm |
Fan speed of the Ascend series AI processors |
RPM |
Natural number |
|
AI Core Usage |
ma_node_npu_ai_core_util |
AI core usage of Ascend AI processors |
% |
0%–100% |
|
AI Core Clock Frequency |
ma_node_npu_ai_core_frequency_hertz |
AI core clock frequency of Ascend AI processors |
Hertz (Hz) |
>0 |
|
AI Processor Voltage |
ma_node_npu_ai_core_voltage_volts |
Voltage of Ascend AI processors |
Volt (V) |
Natural number |
|
AI Processor DDR Memory |
ma_node_npu_ddr_memory_bytes |
Total DDR memory capacity of Ascend AI processors |
Byte |
>0 |
|
AI Processor DDR Usage |
ma_node_npu_ddr_memory_usage_bytes |
DDR memory usage of Ascend AI processors |
Byte |
>0 |
|
AI Processor DDR Memory Utilization |
ma_node_npu_ddr_memory_util |
DDR memory utilization of Ascend AI processors |
% |
0%–100% |
|
AI Processor HBM Memory |
ma_node_npu_hbm_bytes |
Total HBM memory of Ascend AI processors (dedicated for Ascend snt9 processors) |
Byte |
>0 |
|
AI Processor HBM Memory Usage |
ma_node_npu_hbm_usage_bytes |
HBM memory usage of Ascend AI processors (dedicated for Ascend snt9 processors) |
Byte |
>0 |
|
AI Processor HBM Memory Utilization |
ma_node_npu_hbm_util |
HBM memory utilization of Ascend AI processors (dedicated for Ascend snt9 processors) |
% |
0%–100% |
|
AI Processor HBM Memory Bandwidth Utilization |
ma_node_npu_hbm_bandwidth_util |
HBM memory bandwidth utilization of Ascend AI processors (dedicated for Ascend snt9 processors) |
% |
0%–100% |
|
AI Processor HBM Memory Clock Frequency |
ma_node_npu_hbm_frequency_hertz |
HBM memory clock frequency of Ascend AI processors (dedicated for Ascend snt9 processors) |
Hertz (Hz) |
>0 |
|
AI Processor HBM Memory Temperature |
ma_node_npu_hbm_temperature_celsius |
HBM memory temperature of Ascend AI processors (dedicated for Ascend snt9 processors) |
°C |
Natural number |
|
AI CPU Utilization |
ma_node_npu_ai_cpu_util |
AI CPU utilization of Ascend AI processors |
% |
0%–100% |
|
AI Processor Control CPU Utilization |
ma_node_npu_ctrl_cpu_util |
Control CPU utilization of Ascend AI processors |
% |
0%–100% |
|
InfiniBand or RoCE network |
Total Amount of Data Received by a NIC |
ma_node_infiniband_port_received_data_bytes_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. |
counting in double words, 32 bits |
≥0 |
Total Amount of Data Sent by a NIC |
ma_node_infiniband_port_transmitted_data_bytes_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. |
counting in double words, 32 bits |
≥0 |
|
NFS mounting status |
NFS Getattr Congestion Time |
ma_node_mountstats_getattr_backlog_wait |
Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥0 |
NFS Getattr Round Trip Time |
ma_node_mountstats_getattr_rtt |
Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥0 |
|
NFS Access Congestion Time |
ma_node_mountstats_access_backlog_wait |
Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥0 |
|
NFS Access Round Trip Time |
ma_node_mountstats_access_rtt |
Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥0 |
|
NFS Lookup Congestion Time |
ma_node_mountstats_lookup_backlog_wait |
Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥0 |
|
NFS Lookup Round Trip Time |
ma_node_mountstats_lookup_rtt |
Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥0 |
|
NFS Read Congestion Time |
ma_node_mountstats_read_backlog_wait |
Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥0 |
|
NFS Read Round Trip Time |
ma_node_mountstats_read_rtt |
Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥0 |
|
NFS Write Congestion Time |
ma_node_mountstats_write_backlog_wait |
Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥0 |
|
NFS Write Round Trip Time |
ma_node_mountstats_write_rtt |
Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥0 |
Classification |
Metric |
Description |
---|---|---|
Container metrics |
pod_name |
Name of the pod to which the container belongs |
pod_id |
ID of the pod to which the container belongs |
|
node_ip |
IP address of the node to which the container belongs |
|
container_id |
Container ID |
|
cluster_id |
Cluster ID |
|
cluster_name |
Cluster name |
|
container_name |
Name of the container |
|
namespace |
Namespace where the POD created by the user is located. |
|
app_kind |
The value is obtained from the kind field in the first ownerReferences. |
|
app_id |
The value is obtained from the uid field in the first ownerReferences. |
|
app_name |
The value is obtained from the name field in the first ownerReferences. |
|
npu_id |
Ascend card ID, for example, davinci0 (to be discarded) |
|
device_id |
Physical ID of Ascend AI processors |
|
device_type |
Type of Ascend AI processors |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
pool_name |
Name of a resource pool corresponding to a physical dedicated resource pool |
|
gpu_uuid |
UUID of the GPU used by the container |
|
gpu_index |
Index of the GPU used by the container |
|
gpu_type |
Type of the GPU used by the container |
|
Node metrics |
cluster_id |
ID of the CCE cluster to which the node belongs |
node_ip |
IP address of the node |
|
host_name |
Hostname of a node |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
project_id |
Project ID of the user in a physical dedicated resource pool |
|
npu_id |
Ascend card ID, for example, davinci0 (to be discarded) |
|
device_id |
Physical ID of Ascend AI processors |
|
device_type |
Type of Ascend AI processors |
|
gpu_uuid |
UUID of a node GPU |
|
gpu_index |
Index of a node GPU |
|
gpu_type |
Type of a node GPU |
|
device_name |
Device name of an InfiniBand or RoCE network NIC |
|
port |
Port number of the IB NIC |
|
physical_state |
Status of each port on the IB NIC |
|
firmware_version |
Firmware version of the InfiniBand NIC |
|
filesystem |
NFS-mounted file system |
|
mount_point |
NFS mount point |
|
Diagnos |
cluster_id |
ID of the CCE cluster to which the node with the GPU equipped belongs |
node_ip |
IP address of the node where the GPU resides |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
project_id |
Project ID of the user in a physical dedicated resource pool |
|
gpu_uuid |
GPU UUID |
|
gpu_index |
Index of a node GPU |
|
gpu_type |
Type of a node GPU |
|
device_name |
Device name of an InfiniBand or RoCE network NIC |
|
port |
Port number of the IB NIC |
|
physical_state |
Status of each port on the IB NIC |
|
firmware_version |
Firmware version of the InfiniBand NIC |
Monitoring Custom Metrics
ModelArts allows you to run commands to save custom metrics to AOM.
Constraints
- ModelArts invokes the commands or HTTP APIs specified in the custom configuration every 10 seconds to retrieve metric data.
- The size of the metric data text returned by these commands or HTTP APIs must not exceed 8 KB.
Collecting Custom Metric Data Using Commands
The following is an example of the YAML file for creating a pod for collecting custom metrics:
apiVersion: v1 kind: Pod metadata: name: my-task annotations: ei.huaweicloud.com/metrics: '{"customMetrics":[{"containerName":"my-task","exec":{"command":["cat","/metrics/task.prom"]}}]}' # Replace the containerName and command parameters based on the container from which metric data is obtained and the command used to obtain metric data. spec: containers: - name: my-task image: my-task-image:latest # Replace it with the actual image.
Note: The service workload and custom metric collection can share the same container. Alternatively, use the SideCar container to collect metric data and designate it as the custom metric collection container. This ensures that the resources of the service workload container remain unaffected.
Data Format of Custom Metrics
The format of custom metrics data must comply with the open metrics specifications. That is, the format of each metric must be:
<Metric name>{<Tag name>=<Tag value>, ...} <Sampled value>[Millisecond timestamp]
The following is an example (the comment starts with #, which is optional):
# HELP http_requests_total The total number of HTTP requests. # TYPE http_requests_total gauge html_http_requests_total{method="post",code="200"} 1656 1686660980680 html_http_requests_total{method="post",code="400"} 2 1686660980681
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot