Viewing All ModelArts Monitoring Metrics on the AOM Console
ModelArts periodically collects the usage of key metrics (such as GPUs, NPUs, CPUs, and memory) of each node in a resource pool as well as the usage of key metrics of development environments, training jobs, and inference services, and reports the data to AOM. You can view the information on AOM.
Viewing Monitoring Metrics on the AOM Console
- Log in to the console and search for AOM to go to the AOM console.
- In the navigation pane on the left, choose Metric Browsing.
- Select the Prometheus_AOM_Default instance from the drop-down list.
Figure 1 Specifying the metric source
- Select one or more metrics from All metrics or Prometheus statement.
Figure 2 Adding a metric
For details about how to view metrics, see Application Operations Management > User Guide (2.0) > Metric Browsing in the Huawei Cloud Help Center.
Container-level Metrics
Category |
Name |
Metric |
Description |
Unit |
Value Range |
Alarm Threshold |
Alarm Severity |
Solution |
---|---|---|---|---|---|---|---|---|
CPU |
CPU Usage |
ma_container_cpu_util |
CPU usage of a measured object |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
Used CPU Cores |
ma_container_cpu_used_core |
Number of CPU cores used by a measured object |
Cores |
≥ 0 |
N/A |
N/A |
N/A |
|
Total CPU Cores |
ma_container_cpu_limit_core |
Total number of CPU cores that have been applied for a measured object |
Cores |
≥ 1 |
N/A |
N/A |
N/A |
|
CPU Memory Usage |
ma_container_gpu_mem_util |
Percentage of the used GPU memory to the total GPU memory |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
Memory |
Total Physical Memory |
ma_container_memory_capacity_megabytes |
Total physical memory that has been requested for a measured object |
MB |
≥ 0 |
N/A |
N/A |
N/A |
Physical Memory Usage |
ma_container_memory_util |
Percentage of the used physical memory to the total physical memory |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
Used Physical Memory |
ma_container_memory_used_megabytes |
Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set) (Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes) |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
Storage |
Disk Read Rate |
ma_container_disk_read_kilobytes |
Volume of data read from a disk per second |
KB/s |
≥ 0 |
N/A |
N/A |
N/A |
Disk Write Rate |
ma_container_disk_write_kilobytes |
Volume of data written into a disk per second |
KB/s |
≥ 0 |
N/A |
N/A |
N/A |
|
GPU memory |
Total GPU Memory |
ma_container_gpu_mem_total_megabytes |
Total GPU memory of a training job |
MB |
> 0 |
N/A |
N/A |
N/A |
GPU Memory Usage |
ma_container_gpu_mem_util |
Percentage of the used GPU memory to the total GPU memory |
% |
0%–100% |
N/A |
N/A |
N/A |
|
Used GPU Memory |
ma_container_gpu_mem_used_megabytes |
GPU memory used by a measured object |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
Idle GPU Memory |
ma_container_gpu_mem_free_megabytes |
Idle GPU memory of a measured object |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
GPU |
GPU usage |
ma_container_gpu_util |
GPU usage of a measured object |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
GPU Memory Bandwidth Usage |
ma_container_gpu_mem_copy_util |
GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of GP Vnt1 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%. |
% |
0%–100% |
N/A |
N/A |
N/A |
|
GPU Encoder Usage |
ma_container_gpu_enc_util |
GPU encoder usage of a measured object |
% |
% |
N/A |
N/A |
N/A |
|
GPU Decoder Usage |
ma_container_gpu_dec_util |
GPU decoder usage of a measured object |
% |
% |
N/A |
N/A |
N/A |
|
GPU Temperature |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature |
°C |
Natural number |
N/A |
N/A |
N/A |
|
GPU Power |
DCGM_FI_DEV_POWER_USAGE |
GPU power |
Watt (W) |
> 0 |
N/A |
N/A |
N/A |
|
GPU Memory Temperature |
DCGM_FI_DEV_MEMORY_TEMP |
GPU memory temperature |
°C |
Natural number |
N/A |
N/A |
N/A |
|
Network I/O |
Downlink Rate |
ma_container_network_receive_bytes |
Inbound traffic rate of a measured object |
Bytes/s |
≥ 0 |
N/A |
N/A |
N/A |
Packet RX Rate |
ma_container_network_receive_packets |
Number of data packets received by an NIC per second |
Packets/s |
≥ 0 |
N/A |
N/A |
N/A |
|
Downlink Error Rate |
ma_container_network_receive_error_packets |
Number of error packets received by a NIC per second |
Packets/s |
≥ 0 |
Raw data > 1 for two consecutive periods |
Critical |
Packet loss on the network. Submit a service ticket and contact the O&M support to locate the fault. |
|
Uplink Rate |
ma_container_network_transmit_bytes |
Outbound traffic rate of a measured object |
Bytes/s |
≥ 0 |
N/A |
N/A |
N/A |
|
Uplink Error Rate |
ma_container_network_transmit_error_packets |
Number of error packets sent by a NIC per second |
Packets/s |
≥ 0 |
Raw data > 1 for two consecutive periods |
Critical |
Packet loss on the network. Submit a service ticket and contact the O&M support to locate the fault. |
|
Packet TX Rate |
ma_container_network_transmit_packets |
Number of data packets sent by a NIC per second |
Packets/s |
≥ 0 |
N/A |
N/A |
N/A |
|
NPU |
NPU Usage |
ma_container_npu_util |
NPU usage of a measured object (To be replaced by ma_container_npu_ai_core_util) |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
NPU Memory Usage |
ma_container_npu_memory_util |
Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_container_npu_ddr_memory_util for Snt3 series, and ma_container_npu_hbm_util for Snt9 series) |
% |
0%–100% |
Raw data > 98% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
Used NPU Memory |
ma_container_npu_memory_used_megabytes |
NPU memory used by a measured object (To be replaced by ma_container_npu_ddr_memory_usage_bytes for Snt3 series, and ma_container_npu_hbm_usage_bytes for Snt9 series) |
≥ 0 |
MB |
N/A |
N/A |
N/A |
|
Total NPU Memory |
ma_container_npu_memory_total_megabytes |
Total NPU memory of a measured object (To be replaced by ma_container_npu_ddr_memory_bytes for Snt3 series, and ma_container_npu_hbm_bytes for Snt9 series) |
> 0 |
MB |
N/A |
N/A |
N/A |
|
AI Processor Error Codes |
ma_container_npu_ai_core_error_code |
Error codes of Ascend AI processors |
- |
- |
Raw data > 0 for three consecutive periods |
Critical |
Abnormal card. Submit a service ticket and contact the O&M support. |
|
AI Processor Health Status |
ma_container_npu_ai_core_health_status |
Health status of Ascend AI processors |
- |
|
Raw data > 0 for two consecutive periods |
Critical |
Abnormal card. Submit a service ticket and contact the O&M support. |
|
AI Processor Power Consumption |
ma_container_npu_ai_core_power_usage_watts |
Power consumption of Ascend AI processors |
Watt (W) |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor Temperature |
ma_container_npu_ai_core_temperature_celsius |
Temperature of Ascend AI processors |
°C |
Natural number |
N/A |
N/A |
N/A |
|
AI Core Usage |
ma_container_npu_ai_core_util |
AI core usage of Ascend AI processors |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
Overall NPU Usage |
ma_container_npu_general_util |
NPU usage of Ascend AI processors (supported by driver version 24.1.RC2 and later) |
% |
0%–100% |
N/A |
N/A |
N/A |
|
AI Core Clock Frequency |
ma_container_npu_ai_core_frequency_hertz |
AI core clock frequency of Ascend AI processors |
Hertz (Hz) |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor Voltage |
ma_container_npu_ai_core_voltage_volts |
Voltage of Ascend AI processors |
Volt (V) |
Natural number |
N/A |
N/A |
N/A |
|
AI Processor DDR Memory |
ma_container_npu_ddr_memory_bytes |
Total DDR memory capacity of Ascend AI processors |
Byte |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor DDR Usage |
ma_container_npu_ddr_memory_usage_bytes |
DDR memory usage of Ascend AI processors |
Byte |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor DDR Memory Utilization |
ma_container_npu_ddr_memory_util |
DDR memory utilization of Ascend AI processors Invalid metric for Snt9C. |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
AI Processor HBM Memory |
ma_container_npu_hbm_bytes |
Total HBM memory of Ascend AI processors (dedicated for Snt9 processors) |
Byte |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor HBM Memory Usage |
ma_container_npu_hbm_usage_bytes |
HBM memory usage of Ascend AI processors (dedicated for Snt9 processors) |
Byte |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor HBM Memory Utilization |
ma_container_npu_hbm_util |
HBM memory utilization of Ascend AI processors (dedicated for Snt9 processors) |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
AI Processor HBM Memory Bandwidth Utilization |
ma_container_npu_hbm_bandwidth_util |
HBM memory bandwidth utilization of Ascend AI processors (dedicated for Snt9 processors) |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
AI Processor HBM Memory Clock Frequency |
ma_container_npu_hbm_frequency_hertz |
HBM memory clock frequency of Ascend AI processors (dedicated for Snt9 processors) |
Hertz (Hz) |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor HBM Memory Temperature |
ma_container_npu_hbm_temperature_celsius |
HBM memory temperature of Ascend AI processors (dedicated for Snt9 processors) |
°C |
Natural number |
N/A |
N/A |
N/A |
|
AI CPU Utilization |
ma_container_npu_ai_cpu_util |
AI CPU utilization of Ascend AI processors |
% |
0%–100% |
N/A |
N/A |
N/A |
|
AI Processor Control CPU Utilization |
ma_container_npu_ctrl_cpu_util |
Control CPU utilization of Ascend AI processors |
% |
0%–100% |
N/A |
N/A |
N/A |
|
AI Processor Control CPU Frequency |
ma_node_npu_ctrl_cpu_frequency_hertz |
Control CPU frequency of Ascend AI processors |
Hertz (Hz) |
> 0 System mode (dedicated resource pool user mode) |
N/A |
N/A |
N/A |
|
AI Vector Core Usage |
ma_container_npu_vector_core_util |
AI vector core usage of Ascend AI processors |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
NPU RoCE network |
NPU RoCE Network Uplink Rate |
ma_container_npu_roce_tx_rate_bytes_per_second |
Uplink rate of the NPU network module used by the container |
Bytes/s |
≥ 0 |
N/A |
N/A |
N/A |
NPU RoCE Network Downlink Rate |
ma_container_npu_roce_rx_rate_bytes_per_second |
Downlink rate of the NPU network module used by the container |
Bytes/s |
≥ 0 |
N/A |
N/A |
N/A |
|
Notebook service metrics |
Notebook Cache Directory Size |
ma_container_notebook_cache_dir_size_bytes |
A high-speed local disk is attached to the /cache directory for GPU and NPU notebook instances. This metric indicates the total size of the directory. |
Bytes |
≥ 0 |
N/A |
N/A |
N/A |
Notebook Cache Directory Utilization |
ma_container_notebook_cache_dir_util |
A high-speed local disk is attached to the /cache directory for GPU and NPU notebook instances. This metric indicates the utilization of the directory. |
% |
0%–100% |
Raw data > 90% for two consecutive periods |
Major |
If the disk usage is too high, the notebook instance will be restarted. |
Node-level Metrics
Category |
Name |
Metric |
Description |
Unit |
Value Range |
Alarm Threshold |
Alarm Severity |
Solution |
---|---|---|---|---|---|---|---|---|
CPU |
Total CPU Cores |
ma_node_cpu_limit_core |
Total number of CPU cores that have been requested for a measured object |
Cores |
≥ 1 |
N/A |
N/A |
N/A |
Used CPU Cores |
ma_node_cpu_used_core |
Number of CPU cores used by a measured object |
Cores |
≥ 0 |
N/A |
N/A |
N/A |
|
CPU Usage |
ma_node_cpu_util |
CPU usage of a measured object |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Major |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
CPU I/O Wait Time |
ma_node_cpu_iowait_counter |
Disk I/O wait time accumulated since system startup |
jiffies |
≥ 0 |
N/A |
N/A |
N/A |
|
Memory |
Physical Memory Usage |
ma_node_memory_util |
Percentage of the used physical memory to the total physical memory |
% |
0%–100% |
Raw data > 95% for two consecutive periods |
Major |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
Total Physical Memory |
ma_node_memory_total_megabytes |
Total physical memory that has been applied for a measured object |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
Network I/O |
Downlink Rate (BPS) |
ma_node_network_receive_rate_bytes_seconds |
Inbound traffic rate of a measured object |
Bytes/s |
≥ 0 |
N/A |
N/A |
N/A |
Uplink Rate (BPS) |
ma_node_network_transmit_rate_bytes_seconds |
Outbound traffic rate of a measured object |
Bytes/s |
≥ 0 |
N/A |
N/A |
N/A |
|
Storage |
Disk Read Rate |
ma_node_disk_read_rate_kilobytes_seconds |
Volume of data read from a disk per second (Only data disks used by containers are collected.) |
KB/s |
≥ 0 |
N/A |
N/A |
N/A |
Disk Write Rate |
ma_node_disk_write_rate_kilobytes_seconds |
Volume of data written into a disk per second (Only data disks used by containers are collected.) |
KB/s |
≥ 0 |
N/A |
N/A |
N/A |
|
Total Cache |
ma_node_cache_space_capacity_megabytes |
Total cache of the Kubernetes space |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
Used Cache |
ma_node_cache_space_used_capacity_megabytes |
Used cache of the Kubernetes space |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
Cache Usage |
ma_node_cache_space_used_percent |
Cache usage of the Kubernetes space |
% |
≥ 0 |
Raw data > 90% for two consecutive periods |
Critical |
Check the disk in a timely manner to avoid affecting services. Clear invalid data on compute nodes. |
|
Total Container Space |
ma_node_container_space_capacity_megabytes |
Total container space |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
Used Container Space |
ma_node_container_space_used_capacity_megabytes |
Used container space |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
Container Space Usage |
ma_node_container_space_used_percent |
Space usage of a container |
% |
≥ 0 |
Raw data > 90% for two consecutive periods |
Critical |
Check the disk in a timely manner to avoid affecting services. Clear invalid data on compute nodes. |
|
Disk Information |
ma_node_disk_info |
Basic disk information |
- |
≥ 0 |
N/A |
N/A |
N/A |
|
Total Reads |
ma_node_disk_reads_completed_total |
Total number of successful reads |
- |
≥ 0 |
N/A |
N/A |
N/A |
|
Merged Reads |
ma_node_disk_reads_merged_total |
Number of merged reads |
- |
≥ 0 |
N/A |
N/A |
N/A |
|
Bytes Read |
ma_node_disk_read_bytes_total |
Total number of bytes that are successfully read |
Bytes |
≥ 0 |
N/A |
N/A |
N/A |
|
Read Time Spent |
ma_node_disk_read_time_seconds_total |
Time spent on all reads |
Seconds |
≥ 0 |
N/A |
N/A |
N/A |
|
Total Writes |
ma_node_disk_writes_completed_total |
Total number of successful writes |
- |
≥ 0 |
N/A |
N/A |
N/A |
|
Merged Writes |
ma_node_disk_writes_merged_total |
Number of merged writes |
- |
≥ 0 |
N/A |
N/A |
N/A |
|
Written Bytes |
ma_node_disk_written_bytes_total |
Total number of bytes that are successfully written |
Bytes |
≥ 0 |
N/A |
N/A |
N/A |
|
Write Time Spent |
ma_node_disk_write_time_seconds_total |
Time spent on all write operations |
Seconds |
≥ 0 |
N/A |
N/A |
N/A |
|
Ongoing I/Os |
ma_node_disk_io_now |
Number of ongoing I/Os |
- |
≥ 0 |
N/A |
N/A |
N/A |
|
I/O Execution Duration |
ma_node_disk_io_time_seconds_total |
Time spent on executing I/Os |
Seconds |
≥ 0 |
N/A |
N/A |
N/A |
|
I/O Execution Weighted Time |
ma_node_disk_io_time_weighted_seconds_tota |
Weighted time spent on executing I/Os |
Seconds |
≥ 0 |
N/A |
N/A |
N/A |
|
GPU |
GPU Usage |
ma_node_gpu_util |
GPU usage of a measured object |
% |
0%–100% |
N/A |
N/A |
N/A |
Total GPU Memory |
ma_node_gpu_mem_total_megabytes |
Total GPU memory of a measured object |
MB |
> 0 |
N/A |
N/A |
N/A |
|
GPU Memory Usage |
ma_node_gpu_mem_util |
Percentage of the used GPU memory to the total GPU memory |
% |
0%–100% |
Raw data > 97% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
Used GPU Memory |
ma_node_gpu_mem_used_megabytes |
GPU memory used by a measured object |
MB |
≥ 0 |
N/A |
N/A |
N/A |
|
Idle GPU Memory |
ma_node_gpu_mem_free_megabytes |
Idle GPU memory of a measured object |
MB |
> 0 |
N/A |
N/A |
N/A |
|
Tasks on a Shared GPU |
node_gpu_share_job_count |
Number of tasks running on a shared GPU |
Number |
≥ 0 |
N/A |
N/A |
N/A |
|
GPU Temperature |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature |
°C |
Natural number |
N/A |
N/A |
N/A |
|
GPU Power |
DCGM_FI_DEV_POWER_USAGE |
GPU power |
Watt (W) |
> 0 |
N/A |
N/A |
N/A |
|
GPU Memory Temperature |
DCGM_FI_DEV_MEMORY_TEMP |
GPU memory temperature |
°C |
Natural number |
N/A |
N/A |
N/A |
|
NPU |
NPU Usage |
ma_node_npu_util |
NPU usage of a measured object (To be replaced by ma_node_npu_ai_core_util) |
% |
0%–100% |
N/A |
N/A |
N/A |
NPU Memory Usage |
ma_node_npu_memory_util |
Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_node_npu_ddr_memory_util for Snt3 series, and ma_node_npu_hbm_util for Snt9 series) |
% |
0%–100% |
Raw data > 97% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
Used NPU Memory |
ma_node_npu_memory_used_megabytes |
NPU memory used by a measured object (To be replaced by ma_node_npu_ddr_memory_usage_bytes for Snt3 series, and ma_node_npu_hbm_usage_bytes for Snt9 series) |
≥ 0 |
MB |
N/A |
N/A |
N/A |
|
Total NPU Memory |
ma_node_npu_memory_total_megabytes |
Total NPU memory of a measured object (To be replaced by ma_node_npu_ddr_memory_bytes for Snt3 series, and ma_node_npu_hbm_bytes for Snt9 series) |
> 0 |
MB |
N/A |
N/A |
N/A |
|
AI Processor Error Codes |
ma_node_npu_ai_core_error_code |
Error codes of Ascend AI processors |
- |
- |
N/A |
N/A |
N/A |
|
AI Processor Health Status |
ma_node_npu_ai_core_health_status |
Health status of Ascend AI processors |
- |
|
The value is 0 for two consecutive periods. |
Critical |
Submit a service ticket. |
|
AI Processor Power Consumption |
ma_node_npu_ai_core_power_usage_watts |
Power consumption of Ascend AI processors |
Watt (W) |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor Temperature |
ma_node_npu_ai_core_temperature_celsius |
Temperature of Ascend AI processors |
°C |
Natural number |
N/A |
N/A |
N/A |
|
AI Core Usage |
ma_node_npu_ai_core_util |
AI core usage of Ascend AI processors |
% |
0%–100% |
N/A |
N/A |
N/A |
|
Overall NPU Usage |
ma_node_npu_general_util |
NPU usage of Ascend AI processors (supported by driver version 24.1.RC2 and later) |
% |
0%–100% |
N/A |
N/A |
N/A |
|
AI Core Clock Frequency |
ma_node_npu_ai_core_frequency_hertz |
AI core clock frequency of Ascend AI processors |
Hertz (Hz) |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor Voltage |
ma_node_npu_ai_core_voltage_volts |
Voltage of Ascend AI processors |
Volt (V) |
Natural number |
N/A |
N/A |
N/A |
|
AI Processor DDR Memory |
ma_node_npu_ddr_memory_bytes |
Total DDR memory capacity of Ascend AI processors Invalid metric for Snt9C. |
Byte |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor DDR Usage |
ma_node_npu_ddr_memory_usage_bytes |
DDR memory usage of Ascend AI processors |
Byte |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor DDR Memory Utilization |
ma_node_npu_ddr_memory_util |
DDR memory utilization of Ascend AI processors |
% |
0%–100% |
Raw data > 90% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
AI Processor HBM Memory |
ma_node_npu_hbm_bytes |
Total HBM memory of Ascend AI processors (dedicated for Snt9 processors) |
Byte |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor HBM Memory Usage |
ma_node_npu_hbm_usage_bytes |
HBM memory usage of Ascend AI processors (dedicated for Snt9 processors) |
Byte |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor HBM Memory Utilization |
ma_node_npu_hbm_util |
HBM memory utilization of Ascend AI processors (dedicated for Snt9 processors) |
% |
0%–100% |
Raw data > 97% for two consecutive periods |
Suggestion |
Check whether the service resource usage meets the expectation. If the service is normal, no action is required. |
|
AI Processor HBM Memory Bandwidth Utilization |
ma_node_npu_hbm_bandwidth_util |
HBM memory bandwidth utilization of Ascend AI processors (dedicated for Snt9 processors) |
% |
0%–100% |
N/A |
N/A |
N/A |
|
AI Processor HBM Memory Clock Frequency |
ma_node_npu_hbm_frequency_hertz |
HBM memory clock frequency of Ascend AI processors (dedicated for Snt9 processors) |
Hertz (Hz) |
> 0 |
N/A |
N/A |
N/A |
|
AI Processor HBM Memory Temperature |
ma_node_npu_hbm_temperature_celsius |
HBM memory temperature of Ascend AI processors (dedicated for Snt9 processors) |
°C |
Natural number |
N/A |
N/A |
N/A |
|
AI CPU Utilization |
ma_node_npu_ai_cpu_util |
AI CPU utilization of Ascend AI processors |
% |
0%–100% |
N/A |
N/A |
N/A |
|
AI Processor Control CPU Utilization |
ma_node_npu_ctrl_cpu_util |
Control CPU utilization of Ascend AI processors |
% |
0%–100% |
N/A |
N/A |
N/A |
|
AI Processor Control CPU Frequency |
ma_node_npu_ctrl_cpu_frequency_hertz |
Control CPU frequency of Ascend AI processors |
Hertz (Hz) |
> 0 System mode (available for dedicated resource pool users) |
N/A |
N/A |
N/A |
|
HBM ECC Detection Switch |
ma_node_npu_hbm_ecc_enable |
0 indicates that ECC detection is disabled. 1 indicates that ECC detection is enabled. |
- |
|
N/A |
N/A |
N/A |
|
Current HBM Single-bit Errors |
ma_node_npu_hbm_single_bit_error_total |
Current number of HBM single-bit errors |
Number |
≥ 0 |
N/A |
N/A |
N/A |
|
Current HBM Multi-bit Errors |
ma_node_npu_hbm_double_bit_error_total |
Current number of HBM multi-bit errors |
Number |
≥ 0 |
N/A |
N/A |
N/A |
|
Total Single-bit Errors in the HBM Life Cycle |
ma_node_npu_hbm_total_single_bit_error_total |
Total number of single-bit errors in the HBM life cycle |
Number |
≥ 0 |
N/A |
N/A |
N/A |
|
Total Multi-bit Errors in the HBM Life Cycle |
ma_node_npu_hbm_total_double_bit_error_total |
Total number of multi-bit errors in the HBM life cycle |
Number |
≥ 0 |
N/A |
N/A |
N/A |
|
Isolated NPU Memory Pages with HBM Single-bit Errors |
ma_node_npu_hbm_single_bit_isolated_pages_total |
Number of isolated NPU memory pages with HBM single-bit errors |
Number |
≥ 0 |
N/A |
N/A |
N/A |
|
Isolated NPU Memory Pages with HBM Multi-bit Errors |
ma_node_npu_hbm_double_bit_isolated_pages_total |
Number of isolated NPU memory pages with HBM multi-bit errors Note: If there are more than 64 pages, change the NPU. |
Number |
≥ 0 |
Raw data ≥ 64 for two consecutive periods |
Critical |
If there are more than 64 pages, submit a service ticket, and switch the NPU server. |
|
AI Vector Core Usage |
ma_node_npu_vector_core_util |
AI vector core usage of Ascend AI processors |
% |
0%–100% |
N/A |
N/A |
N/A |
|
NPU RoCE network |
NPU RoCE Network Uplink Rate |
ma_node_npu_roce_tx_rate_bytes_per_second |
NPU RoCE network uplink rate |
Bytes/s |
≥ 0 |
N/A |
N/A |
N/A |
NPU RoCE Network Downlink Rate |
ma_node_npu_roce_rx_rate_bytes_per_second |
NPU RoCE network downlink rate |
Bytes/s |
≥ 0 |
N/A |
N/A |
N/A |
|
MAC Uplink Pause Frames |
ma_node_npu_roce_mac_tx_pause_packets_total |
Total number of pause frame packets sent by NPU RoCE network MAC |
Number |
≥ 0 |
N/A |
N/A |
N/A |
|
MAC Downlink Pause Frames |
ma_node_npu_roce_mac_rx_pause_packets_total |
Total number of pause frame packets received by NPU RoCE network MAC |
Number |
≥ 0 |
N/A |
N/A |
N/A |
|
MAC Uplink PFC Frames |
ma_node_npu_roce_mac_tx_pfc_packets_total |
Total number of PFC frame packets sent by NPU RoCE network MAC |
Number |
≥ 0 |
delta(ma_node_npu_roce_mac_tx_pause_packets_total[1m]) > 0 |
Major |
Submit a service ticket. |
|
MAC Downlink PFC Frames |
ma_node_npu_roce_mac_rx_pfc_packets_total |
Total number of PFC frame packets received by NPU RoCE network MAC |
Number |
≥ 0 |
delta(ma_node_npu_roce_mac_rx_pause_packets_total[1m]) > 0 |
Major |
Submit a service ticket. |
|
MAC Uplink Bad Packets |
ma_node_npu_roce_mac_tx_bad_packets_total |
Total number of bad packets sent by NPU RoCE network MAC |
Number |
≥ 0 |
delta(ma_node_npu_roce_mac_tx_pfc_packets_total[1m]) > 0 |
Major |
Submit a service ticket. |
|
MAC Downlink Bad Packets |
ma_node_npu_roce_mac_rx_bad_packets_total |
Total number of bad packets received by NPU RoCE network MAC |
Number |
≥ 0 |
delta(ma_node_npu_roce_mac_rx_pfc_packets_total[1m]) > 0 |
Major |
Submit a service ticket. |
|
RoCE Uplink Bad Packets |
ma_node_npu_roce_tx_err_packets_total |
Total number of bad packets sent by NPU RoCE |
Number |
≥ 0 |
delta(ma_node_npu_roce_mac_tx_bad_packets_total[1m]) > 0 |
Major |
Submit a service ticket. |
|
RoCE Downlink Bad Packets |
ma_node_npu_roce_rx_err_packets_total |
Total number of bad packets received by NPU RoCE |
Number |
≥ 0 |
delta(ma_node_npu_roce_mac_rx_bad_packets_total[1m]) > 0 |
Major |
Submit a service ticket. |
|
RoCE Uplink Packets |
ma_node_npu_roce_tx_all_packets_total |
Total number of packets sent by NPU RoCE |
Number |
≥ 0 |
delta(ma_node_npu_roce_tx_err_packets_total[1m]) > 0 |
Major |
Submit a service ticket. |
|
RoCE Downlink Packets |
ma_node_npu_roce_rx_all_packets_total |
Total number of packets received by NPU RoCE |
Number |
≥ 0 |
delta(ma_node_npu_roce_rx_err_packets_total[1m]) > 0 |
Major |
Submit a service ticket. |
|
NPU optical module (This metric is available for Snt9B/C air-cooled networking.) |
Optical Module Temperature |
ma_node_npu_optical_temperature |
Optical module temperature |
°C |
≥ 0 |
N/A |
N/A |
N/A |
Optical Module Power Voltage |
ma_node_npu_optical_vcc |
Power voltage of the optical module |
Millivolt (mV) |
≥ 0 |
N/A |
N/A |
N/A |
|
Optical Module Transmit Power 0 |
ma_node_npu_optical_tx_power0 |
Transmit power 0 of the optical module |
Milliwatt (mW) |
≥ 0 |
N/A |
N/A |
N/A |
|
Optical Module Transmit Power 1 |
ma_node_npu_optical_tx_power1 |
Transmit power 1 of the optical module |
Milliwatt (mW) |
≥ 0 |
N/A |
N/A |
N/A |
|
Optical Module Transmit Power 2 |
ma_node_npu_optical_tx_power2 |
Transmit power 2 of the optical module |
Milliwatt (mW) |
≥ 0 |
N/A |
N/A |
N/A |
|
Optical Module Transmit Power 3 |
ma_node_npu_optical_tx_power3 |
Transmit power 3 of the optical module |
Milliwatt (mW) |
≥ 0 |
N/A |
N/A |
N/A |
|
Optical Module Receive Power 0 |
ma_node_npu_optical_rx_power0 |
Receive power 0 of the optical module |
Milliwatt (mW) |
≥ 0 |
N/A |
N/A |
N/A |
|
Optical Module Receive Power 1 |
ma_node_npu_optical_rx_power1 |
Receive power 1 of the optical module |
Milliwatt (mW) |
≥ 0 |
N/A |
N/A |
N/A |
|
Optical Module Receive Power 2 |
ma_node_npu_optical_rx_power2 |
Receive power 2 of the optical module |
Milliwatt (mW) |
≥ 0 |
N/A |
N/A |
N/A |
|
Optical Module Receive Power 3 |
ma_node_npu_optical_rx_power3 |
Receive power 3 of the optical module |
Milliwatt (mW) |
≥ 0 |
N/A |
N/A |
N/A |
|
InfiniBand or RoCE network |
Total Amount of Data Received by a NIC |
ma_node_infiniband_port_received_data_bytes_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. |
(counting in double words, 32 bits |
≥ 0 |
N/A |
N/A |
N/A |
Total Amount of Data Sent by a NIC |
ma_node_infiniband_port_transmitted_data_bytes_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. |
(counting in double words, 32 bits |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS mounting status |
NFS Getattr Congestion Time |
ma_node_mountstats_getattr_backlog_wait |
Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
NFS Getattr Round Trip Time |
ma_node_mountstats_getattr_rtt |
Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS Access Congestion Time |
ma_node_mountstats_access_backlog_wait |
Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS Access Round Trip Time |
ma_node_mountstats_access_rtt |
Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS Lookup Congestion Time |
ma_node_mountstats_lookup_backlog_wait |
Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS Lookup Round Trip Time |
ma_node_mountstats_lookup_rtt |
Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS Read Congestion Time |
ma_node_mountstats_read_backlog_wait |
Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS Read Round Trip Time |
ma_node_mountstats_read_rtt |
Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS Write Congestion Time |
ma_node_mountstats_write_backlog_wait |
Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
|
NFS Write Round Trip Time |
ma_node_mountstats_write_rtt |
Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
N/A |
N/A |
N/A |
Networking Metrics
Category |
Name |
Metric |
Description |
Unit |
Value Range |
---|---|---|---|---|---|
InfiniBand or RoCE network |
PortXmitData |
infiniband_port_xmit_data_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. |
Total count |
Natural number |
PortRcvData |
infiniband_port_rcv_data_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. |
Total count |
Natural number |
|
SymbolErrorCounter |
infiniband_symbol_error_counter_total |
Total number of minor link errors detected on one or more physical lanes. |
Total count |
Natural number |
|
LinkErrorRecoveryCounter |
infiniband_link_error_recovery_counter_total |
Total number of times the Port Training state machine has successfully completed the link error recovery process. |
Total count |
Natural number |
|
PortRcvErrors |
infiniband_port_rcv_errors_total |
Total number of packets containing errors that were received on the port including: Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine) Malformed data packet errors (LVer, length, VL) Malformed link packet errors (operand, length, VL) Packets discarded due to buffer overrun (overflow) |
Total count |
Natural number |
|
LocalLinkIntegrityErrors |
infiniband_local_link_integrity_errors_total |
This counter indicates the number of retries initiated by a link transfer layer receiver. |
Total count |
Natural number |
|
PortRcvRemotePhysicalErrors |
infiniband_port_rcv_remote_physical_errors_total |
Total number of packets marked with the EBP delimiter received on the port. |
Total count |
Natural number |
|
PortRcvSwitchRelayErrors |
infiniband_port_rcv_switch_relay_errors_total |
Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons: DLID mapping VL mapping Looping (output port = input port) |
Total count |
Natural number |
|
PortXmitWait |
infiniband_port_transmit_wait_total |
The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration). |
Total count |
Natural number |
|
PortXmitDiscards |
infiniband_port_xmit_discards_total |
Total number of outbound packets discarded by the port because the port is down or congested. |
Total count |
Natural number |
Label Metrics
Classification |
Label |
Description |
---|---|---|
Container metrics |
modelarts_service |
Service to which a container belongs, which can be notebook, train, or infer |
instance_name |
Name of the pod to which the container belongs |
|
service_id |
Instance or job ID displayed on the page, for example, cf55829e-9bd3-48fa-8071-7ae870dae93a for a development environment 9f322d5a-b1d2-4370-94df-5a87de27d36e for a training job |
|
node_ip |
IP address of the node to which the container belongs |
|
container_id |
Container ID |
|
cid |
Cluster ID |
|
container_name |
Container name |
|
project_id |
Project ID of the account to which the user belongs |
|
user_id |
User ID of the account to which the user who submits the job belongs |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
pool_name |
Name of a resource pool corresponding to a physical dedicated resource pool |
|
logical_pool_id |
ID of a logical subpool |
|
logical_pool_name |
Name of a logical subpool |
|
gpu_uuid |
UUID of the GPU used by the container |
|
gpu_index |
Index of the GPU used by the container |
|
gpu_type |
Type of the GPU used by the container |
|
account_name |
Account name of the creator of a training, inference, or development environment task |
|
user_name |
Username of the creator of a training, inference, or development environment task |
|
task_creation_time |
Time when a training, inference, or development environment task is created |
|
task_name |
Name of a training, inference, or development environment task |
|
task_spec_code |
Specifications of a training, inference, or development environment task |
|
cluster_name |
CCE cluster name |
|
Node metrics |
cid |
ID of the CCE cluster to which the node belongs |
node_ip |
IP address of the node |
|
host_name |
Hostname of a node |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
project_id |
Project ID of the user in a physical dedicated resource pool |
|
gpu_uuid |
UUID of a node GPU |
|
gpu_index |
Index of a node GPU |
|
gpu_type |
Type of a node GPU |
|
device_name |
Device name of an InfiniBand or RoCE network NIC |
|
port |
Port number of the InfiniBand NIC |
|
physical_state |
Status of each port on the InfiniBand NIC |
|
firmware_version |
Firmware version of the IB NIC |
|
filesystem |
NFS-mounted file system |
|
mount_point |
NFS mount point |
|
Diagnos |
cid |
ID of the CCE cluster to which the node where the GPU resides belongs |
node_ip |
IP address of the node where the GPU resides |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
project_id |
Project ID of the user in a physical dedicated resource pool |
|
gpu_uuid |
GPU UUID |
|
gpu_index |
Index of a node GPU |
|
gpu_type |
Type of a node GPU |
|
device_name |
Name of a network device or disk device |
|
port |
Port number of the InfiniBand NIC |
|
physical_state |
Status of each port on the InfiniBand NIC |
|
firmware_version |
Firmware version of the InfiniBand NIC |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot