Viewing All ModelArts Monitoring Metrics on the AOM Console

ModelArts periodically collects the usage of key metrics (such as GPUs, NPUs, CPUs, and memory) of each node in a resource pool as well as the usage of key metrics of development environments, training jobs, and inference services, and reports the data to AOM. You can view the information on AOM.

Viewing Monitoring Metrics on the AOM Console

Log in to the console and search for AOM to access the AOM console. In the navigation pane on the left, choose Monitoring > Metric Monitoring. Set the metric source to the default instance Prometheus_AOM_Default and configure the statistical mode and period as required.
At the bottom of the metric browsing page, select one or more target metrics by configuring All metrics or Prometheus statement.
The following case uses obtaining training job metrics by configuring All metrics as an example. Select a specific metric, for example, ma_container_cpu_util. Then, configure Conditions by setting Dimension name to instance_name and Dimension value to the training job ID obtained on the training details page of the ModelArts console. As a result, the metric monitoring curve of the current training job is displayed.

For details about how to add metrics by configuring Prometheus statement, see "Observability Metric Browsing" in Application Operations Management > User Guide (2.0).

For details about more monitoring metrics, see Table 1 and Table 2.

Container-level Metrics

**Table 1** Container metrics
Category	Name	Metric	Description	Unit	Value Range	Alarm Threshold	Alarm Severity	Solution
CPU	CPU Usage	ma_container_cpu_util	CPU usage of a measured object	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	Used CPU Cores	ma_container_cpu_used_core	Number of CPU cores used by a measured object	Cores	≥ 0	N/A	N/A	N/A
	Total CPU Cores	ma_container_cpu_limit_core	Total number of CPU cores that have been applied for a measured object	Cores	≥ 1	N/A	N/A	N/A
	CPU Memory Usage	ma_container_gpu_mem_util	Percentage of the used GPU memory to the total GPU memory	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
Memory	Total Physical Memory	ma_container_memory_capacity_megabytes	Total physical memory that has been requested for a measured object	MB	≥ 0	N/A	N/A	N/A
	Physical Memory Usage	ma_container_memory_util	Percentage of the used physical memory to the total physical memory	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	Used Physical Memory	ma_container_memory_used_megabytes	Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set) (Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes)	MB	≥ 0	N/A	N/A	N/A
Storage	Disk Read Rate	ma_container_disk_read_kilobytes	Volume of data read from a disk per second	KB/s	≥ 0	N/A	N/A	N/A
Storage	Disk Write Rate	ma_container_disk_write_kilobytes	Volume of data written into a disk per second	KB/s	≥ 0	N/A	N/A	N/A
GPU memory	Total GPU Memory	ma_container_gpu_mem_total_megabytes	Total GPU memory of a training job	MB	> 0	N/A	N/A	N/A
	GPU Memory Usage	ma_container_gpu_mem_util	Percentage of the used GPU memory to the total GPU memory	%	0%–100%	N/A	N/A	N/A
	Used GPU Memory	ma_container_gpu_mem_used_megabytes	GPU memory used by a measured object	MB	≥ 0	N/A	N/A	N/A
	Idle GPU Memory	ma_container_gpu_mem_free_megabytes	Idle GPU memory of a measured object	MB	≥ 0	N/A	N/A	N/A
GPU	GPU usage	ma_container_gpu_util	GPU usage of a measured object	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	GPU Memory Bandwidth Usage	ma_container_gpu_mem_copy_util	GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of GP Vnt1 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.	%	0%–100%	N/A	N/A	N/A
	GPU Encoder Usage	ma_container_gpu_enc_util	GPU encoder usage of a measured object	%	%	N/A	N/A	N/A
	GPU Decoder Usage	ma_container_gpu_dec_util	GPU decoder usage of a measured object	%	%	N/A	N/A	N/A
	GPU Temperature	DCGM_FI_DEV_GPU_TEMP	GPU temperature	°C	Natural number	N/A	N/A	N/A
	GPU Power	DCGM_FI_DEV_POWER_USAGE	GPU power	Watt (W)	> 0	N/A	N/A	N/A
	GPU Memory Temperature	DCGM_FI_DEV_MEMORY_TEMP	GPU memory temperature	°C	Natural number	N/A	N/A	N/A
Network I/O	Downlink Rate	ma_container_network_receive_bytes	Inbound traffic rate of a measured object	Bytes/s	≥ 0	N/A	N/A	N/A
	Packet RX Rate	ma_container_network_receive_packets	Number of data packets received by a NIC per second	Packets/s	≥ 0	N/A	N/A	N/A
	Downlink Error Rate	ma_container_network_receive_error_packets	Number of error packets received by a NIC per second	Packets/s	≥ 0	Raw data > 1 for two consecutive periods	Critical	Packet loss on the network. Submit a service ticket and contact the O&M support to locate the fault.
	Uplink Rate	ma_container_network_transmit_bytes	Outbound traffic rate of a measured object	Bytes/s	≥ 0	N/A	N/A	N/A
	Uplink Error Rate	ma_container_network_transmit_error_packets	Number of error packets sent by a NIC per second	Packets/s	≥ 0	Raw data > 1 for two consecutive periods	Critical	Packet loss on the network. Submit a service ticket and contact the O&M support to locate the fault.
	Packet TX Rate	ma_container_network_transmit_packets	Number of data packets sent by a NIC per second	Packets/s	≥ 0	N/A	N/A	N/A
NPU	NPU Usage	ma_container_npu_util	NPU usage of a measured object (To be replaced by ma_container_npu_ai_core_util)	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	NPU Memory Usage	ma_container_npu_memory_util	Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_container_npu_ddr_memory_util for Snt3 series, and ma_container_npu_hbm_util for Snt9 series)	%	0%–100%	Raw data > 98% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	Used NPU Memory	ma_container_npu_memory_used_megabytes	NPU memory used by a measured object (To be replaced by ma_container_npu_ddr_memory_usage_bytes for Snt3 series, and ma_container_npu_hbm_usage_bytes for Snt9 series)	≥ 0	MB	N/A	N/A	N/A
	Total NPU Memory	ma_container_npu_memory_total_megabytes	Total NPU memory of a measured object (To be replaced by ma_container_npu_ddr_memory_bytes for Snt3 series, and ma_container_npu_hbm_bytes for Snt9 series)	> 0	MB	N/A	N/A	N/A
	AI Processor Error Codes	ma_container_npu_ai_core_error_code	Error codes of Ascend AI processors	-	-	Raw data > 0 for three consecutive periods	Critical	Abnormal card. Submit a service ticket and contact the O&M support.
	AI Processor Health Status	ma_container_npu_ai_core_health_status	Health status of Ascend AI processors	-	1: healthy 0: unhealthy	Raw data > 0 for two consecutive periods	Critical	Abnormal card. Submit a service ticket and contact the O&M support.
	AI Processor Power Consumption	ma_container_npu_ai_core_power_usage_watts	Power consumption of Ascend AI processors	Watt (W)	> 0	N/A	N/A	N/A
	AI Processor Temperature	ma_container_npu_ai_core_temperature_celsius	Temperature of Ascend AI processors	°C	Natural number	N/A	N/A	N/A
	AI Core Usage	ma_container_npu_ai_core_util	AI core usage of Ascend AI processors	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	Overall NPU Usage	ma_container_npu_general_util	NPU usage of Ascend AI processors (supported by driver version 24.1.RC2 and later)	%	0%–100%	N/A	N/A	N/A
	AI Core Clock Frequency	ma_container_npu_ai_core_frequency_hertz	AI core clock frequency of Ascend AI processors	Hertz (Hz)	> 0	N/A	N/A	N/A
	AI Processor Voltage	ma_container_npu_ai_core_voltage_volts	Voltage of Ascend AI processors	Volt (V)	Natural number	N/A	N/A	N/A
	AI Processor DDR Memory	ma_container_npu_ddr_memory_bytes	Total DDR memory capacity of Ascend AI processors	Byte	> 0	N/A	N/A	N/A
	AI Processor DDR Usage	ma_container_npu_ddr_memory_usage_bytes	DDR memory usage of Ascend AI processors	Byte	> 0	N/A	N/A	N/A
	AI Processor DDR Memory Utilization	ma_container_npu_ddr_memory_util	DDR memory utilization of Ascend AI processors Invalid metric for Snt9C.	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	AI Processor HBM Memory	ma_container_npu_hbm_bytes	Total HBM memory of Ascend AI processors (dedicated for Snt9 processors)	Byte	> 0	N/A	N/A	N/A
	AI Processor HBM Memory Usage	ma_container_npu_hbm_usage_bytes	HBM memory usage of Ascend AI processors (dedicated for Snt9 processors)	Byte	> 0	N/A	N/A	N/A
	AI Processor HBM Memory Utilization	ma_container_npu_hbm_util	HBM memory utilization of Ascend AI processors (dedicated for Snt9 processors)	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	AI Processor HBM Memory Bandwidth Utilization	ma_container_npu_hbm_bandwidth_util	HBM memory bandwidth utilization of Ascend AI processors (dedicated for Snt9 processors)	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	AI Processor HBM Memory Clock Frequency	ma_container_npu_hbm_frequency_hertz	HBM memory clock frequency of Ascend AI processors (dedicated for Snt9 processors)	Hertz (Hz)	> 0	N/A	N/A	N/A
	AI Processor HBM Memory Temperature	ma_container_npu_hbm_temperature_celsius	HBM memory temperature of Ascend AI processors (dedicated for Snt9 processors)	°C	Natural number	N/A	N/A	N/A
	AI CPU Utilization	ma_container_npu_ai_cpu_util	AI CPU utilization of Ascend AI processors	%	0%–100%	N/A	N/A	N/A
	AI Processor Control CPU Utilization	ma_container_npu_ctrl_cpu_util	Control CPU utilization of Ascend AI processors	%	0%–100%	N/A	N/A	N/A
	AI Processor Control CPU Frequency	ma_node_npu_ctrl_cpu_frequency_hertz	Control CPU frequency of Ascend AI processors	Hertz (Hz)	> 0 System mode (dedicated resource pool user mode)	N/A	N/A	N/A
	AI Vector Core Usage	ma_container_npu_vector_core_util	AI vector core usage of Ascend AI processors	%	0%–100%	Raw data > 95% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	Successful NPU Operator Retransmissions	ma_container_npu_operator_retry_success_cnt	Number of successful NPU operator retransmissions (supported by A3 Ascend HDK 24.1.RC3.3 or later)	Number	≥ 0
	Failed NPU Operator Retransmissions	ma_container_npu_operator_retry_fail_cnt	Number of failed NPU operator retransmissions (supported by A3 Ascend HDK 24.1.RC3.3 or later)	Number	≥ 0
	Number of Times that an NPU Is Used for Communication	ma_container_npu_borrow_comms_cnt	Number of times that an NPU is used for communication. The more the NPU is used, the lower the transmission efficiency is. (supported by A3 Ascend HDK 24.1.RC3.3 or later)	Number	≥ 0
NPU RoCE network	NPU RoCE Network Uplink Rate	ma_container_npu_roce_tx_rate_bytes_per_second	Uplink rate of the NPU network module used by the container	Bytes/s	≥ 0	N/A	N/A	N/A
NPU RoCE network	NPU RoCE Network Downlink Rate	ma_container_npu_roce_rx_rate_bytes_per_second	Downlink rate of the NPU network module used by the container	Bytes/s	≥ 0	N/A	N/A	N/A
Notebook service metrics	Notebook Cache Directory Size	ma_container_notebook_cache_dir_size_bytes	A high-speed local disk is attached to the /cache directory for GPU and NPU notebook instances. This metric indicates the total size of the directory.	Bytes	≥ 0	N/A	N/A	N/A
Notebook service metrics	Notebook Cache Directory Utilization	ma_container_notebook_cache_dir_util	A high-speed local disk is attached to the /cache directory for GPU and NPU notebook instances. This metric indicates the utilization of the directory.	%	0%–100%	Raw data > 90% for two consecutive periods	Major	If the disk usage is too high, the notebook instance will be restarted.

Node-level Metrics

**Table 2** Node metrics (collected only in dedicated resource pools)
Category	Name	Metric	Description	Unit	Value Range	Alarm Threshold	Alarm Severity	Solution
CPU	Total CPU Cores	ma_node_cpu_limit_core	Total number of CPU cores that have been requested for a measured object	Cores	≥ 1	N/A	N/A	N/A
	Used CPU Cores	ma_node_cpu_used_core	Number of CPU cores used by a measured object	Cores	≥ 0	N/A	N/A	N/A
	CPU Usage	ma_node_cpu_util	CPU usage of a measured object	%	0%–100%	Raw data > 95% for two consecutive periods	Major	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	CPU I/O Wait Time	ma_node_cpu_iowait_counter	Disk I/O wait time accumulated since system startup	jiffies	≥ 0	N/A	N/A	N/A
Memory	Physical Memory Usage	ma_node_memory_util	Percentage of the used physical memory to the total physical memory	%	0%–100%	Raw data > 95% for two consecutive periods	Major	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
Memory	Total Physical Memory	ma_node_memory_total_megabytes	Total physical memory that has been applied for a measured object	MB	≥ 0	N/A	N/A	N/A
Network I/O	Downlink Rate (BPS)	ma_node_network_receive_rate_bytes_seconds	Inbound traffic rate of a measured object	Bytes/s	≥ 0	N/A	N/A	N/A
Network I/O	Uplink Rate (BPS)	ma_node_network_transmit_rate_bytes_seconds	Outbound traffic rate of a measured object	Bytes/s	≥ 0	N/A	N/A	N/A
Storage	Disk Read Rate	ma_node_disk_read_rate_kilobytes_seconds	Volume of data read from a disk per second (Only data disks used by containers are collected.)	KB/s	≥ 0	N/A	N/A	N/A
	Disk Write Rate	ma_node_disk_write_rate_kilobytes_seconds	Volume of data written into a disk per second (Only data disks used by containers are collected.)	KB/s	≥ 0	N/A	N/A	N/A
	Total Cache	ma_node_cache_space_capacity_megabytes	Total cache of the Kubernetes space	MB	≥ 0	N/A	N/A	N/A
	Used Cache	ma_node_cache_space_used_capacity_megabytes	Used cache of the Kubernetes space	MB	≥ 0	N/A	N/A	N/A
	Cache Usage	ma_node_cache_space_used_percent	Cache usage of the Kubernetes space	%	≥ 0	Raw data > 90% for two consecutive periods	Critical	Check the disk in a timely manner to avoid affecting services. Clear invalid data on compute nodes.
	Total Container Space	ma_node_container_space_capacity_megabytes	Total container space	MB	≥ 0	N/A	N/A	N/A
	Used Container Space	ma_node_container_space_used_capacity_megabytes	Used container space	MB	≥ 0	N/A	N/A	N/A
	Container Space Usage	ma_node_container_space_used_percent	Space usage of a container	%	≥ 0	Raw data > 90% for two consecutive periods	Critical	Check the disk in a timely manner to avoid affecting services. Clear invalid data on compute nodes.
	Disk Information	ma_node_disk_info	Basic disk information	-	≥ 0	N/A	N/A	N/A
	Total Reads	ma_node_disk_reads_completed_total	Total number of successful reads	-	≥ 0	N/A	N/A	N/A
	Merged Reads	ma_node_disk_reads_merged_total	Number of merged reads	-	≥ 0	N/A	N/A	N/A
	Bytes Read	ma_node_disk_read_bytes_total	Total number of bytes that are successfully read	Bytes	≥ 0	N/A	N/A	N/A
	Read Time Spent	ma_node_disk_read_time_seconds_total	Time spent on all reads	Seconds	≥ 0	N/A	N/A	N/A
	Total Writes	ma_node_disk_writes_completed_total	Total number of successful writes	-	≥ 0	N/A	N/A	N/A
	Merged Writes	ma_node_disk_writes_merged_total	Number of merged writes	-	≥ 0	N/A	N/A	N/A
	Written Bytes	ma_node_disk_written_bytes_total	Total number of bytes that are successfully written	Bytes	≥ 0	N/A	N/A	N/A
	Write Time Spent	ma_node_disk_write_time_seconds_total	Time spent on all write operations	Seconds	≥ 0	N/A	N/A	N/A
	Ongoing I/Os	ma_node_disk_io_now	Number of ongoing I/Os	-	≥ 0	N/A	N/A	N/A
	I/O Execution Duration	ma_node_disk_io_time_seconds_total	Time spent on executing I/Os	Seconds	≥ 0	N/A	N/A	N/A
	I/O Execution Weighted Time	ma_node_disk_io_time_weighted_seconds_tota	Weighted time spent on executing I/Os	Seconds	≥ 0	N/A	N/A	N/A
GPU	GPU Usage	ma_node_gpu_util	GPU usage of a measured object	%	0%–100%	N/A	N/A	N/A
	Total GPU Memory	ma_node_gpu_mem_total_megabytes	Total GPU memory of a measured object	MB	> 0	N/A	N/A	N/A
	GPU Memory Usage	ma_node_gpu_mem_util	Percentage of the used GPU memory to the total GPU memory	%	0%–100%	Raw data > 97% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	Used GPU Memory	ma_node_gpu_mem_used_megabytes	GPU memory used by a measured object	MB	≥ 0	N/A	N/A	N/A
	Idle GPU Memory	ma_node_gpu_mem_free_megabytes	Idle GPU memory of a measured object	MB	> 0	N/A	N/A	N/A
	Tasks on a Shared GPU	node_gpu_share_job_count	Number of tasks running on a shared GPU	Number	≥ 0	N/A	N/A	N/A
	GPU Temperature	DCGM_FI_DEV_GPU_TEMP	GPU temperature	°C	Natural number	N/A	N/A	N/A
	GPU Power	DCGM_FI_DEV_POWER_USAGE	GPU power	Watt (W)	> 0	N/A	N/A	N/A
	GPU Memory Temperature	DCGM_FI_DEV_MEMORY_TEMP	GPU memory temperature	°C	Natural number	N/A	N/A	N/A
NPU	NPU Usage	ma_node_npu_util	NPU usage of a measured object (To be replaced by ma_node_npu_ai_core_util)	%	0%–100%	N/A	N/A	N/A
	NPU Memory Usage	ma_node_npu_memory_util	Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_node_npu_ddr_memory_util for Snt3 series, and ma_node_npu_hbm_util for Snt9 series)	%	0%–100%	Raw data > 97% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	Used NPU Memory	ma_node_npu_memory_used_megabytes	NPU memory used by a measured object (To be replaced by ma_node_npu_ddr_memory_usage_bytes for Snt3 series, and ma_node_npu_hbm_usage_bytes for Snt9 series)	≥ 0	MB	N/A	N/A	N/A
	Total NPU Memory	ma_node_npu_memory_total_megabytes	Total NPU memory of a measured object (To be replaced by ma_node_npu_ddr_memory_bytes for Snt3 series, and ma_node_npu_hbm_bytes for Snt9 series)	> 0	MB	N/A	N/A	N/A
	AI Processor Error Codes	ma_node_npu_ai_core_error_code	Error codes of Ascend AI processors	-	-	N/A	N/A	N/A
	AI Processor Health Status	ma_node_npu_ai_core_health_status	Health status of Ascend AI processors	-	1: healthy 0: unhealthy	The value is 0 for two consecutive periods.	Critical	Submit a service ticket.
	AI Processor Power Consumption	ma_node_npu_ai_core_power_usage_watts	Power consumption of Ascend AI processors	Watt (W)	> 0	N/A	N/A	N/A
	AI Processor Temperature	ma_node_npu_ai_core_temperature_celsius	Temperature of Ascend AI processors	°C	Natural number	N/A	N/A	N/A
	AI Core Usage	ma_node_npu_ai_core_util	AI core usage of Ascend AI processors	%	0%–100%	N/A	N/A	N/A
	Overall NPU Usage	ma_node_npu_general_util	NPU usage of Ascend AI processors (supported by driver version 24.1.RC2 and later)	%	0%–100%	N/A	N/A	N/A
	AI Core Clock Frequency	ma_node_npu_ai_core_frequency_hertz	AI core clock frequency of Ascend AI processors	Hertz (Hz)	> 0	N/A	N/A	N/A
	AI Processor Voltage	ma_node_npu_ai_core_voltage_volts	Voltage of Ascend AI processors	Volt (V)	Natural number	N/A	N/A	N/A
	AI Processor DDR Memory	ma_node_npu_ddr_memory_bytes	Total DDR memory capacity of Ascend AI processors Invalid metric for Snt9C.	Byte	> 0	N/A	N/A	N/A
	AI Processor DDR Usage	ma_node_npu_ddr_memory_usage_bytes	DDR memory usage of Ascend AI processors	Byte	> 0	N/A	N/A	N/A
	AI Processor DDR Memory Utilization	ma_node_npu_ddr_memory_util	DDR memory utilization of Ascend AI processors	%	0%–100%	Raw data > 90% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	AI Processor HBM Memory	ma_node_npu_hbm_bytes	Total HBM memory of Ascend AI processors (dedicated for Snt9 processors)	Byte	> 0	N/A	N/A	N/A
	AI Processor HBM Memory Usage	ma_node_npu_hbm_usage_bytes	HBM memory usage of Ascend AI processors (dedicated for Snt9 processors)	Byte	> 0	N/A	N/A	N/A
	AI Processor HBM Memory Utilization	ma_node_npu_hbm_util	HBM memory utilization of Ascend AI processors (dedicated for Snt9 processors)	%	0%–100%	Raw data > 97% for two consecutive periods	Suggestion	Check whether the service resource usage meets the expectation. If the service is normal, no action is required.
	AI Processor HBM Memory Bandwidth Utilization	ma_node_npu_hbm_bandwidth_util	HBM memory bandwidth utilization of Ascend AI processors (dedicated for Snt9 processors)	%	0%–100%	N/A	N/A	N/A
	AI Processor HBM Memory Clock Frequency	ma_node_npu_hbm_frequency_hertz	HBM memory clock frequency of Ascend AI processors (dedicated for Snt9 processors)	Hertz (Hz)	> 0	N/A	N/A	N/A
	AI Processor HBM Memory Temperature	ma_node_npu_hbm_temperature_celsius	HBM memory temperature of Ascend AI processors (dedicated for Snt9 processors)	°C	Natural number	N/A	N/A	N/A
	AI CPU Utilization	ma_node_npu_ai_cpu_util	AI CPU utilization of Ascend AI processors	%	0%–100%	N/A	N/A	N/A
	AI Processor Control CPU Utilization	ma_node_npu_ctrl_cpu_util	Control CPU utilization of Ascend AI processors	%	0%–100%	N/A	N/A	N/A
	AI Processor Control CPU Frequency	ma_node_npu_ctrl_cpu_frequency_hertz	Control CPU frequency of Ascend AI processors	Hertz (Hz)	> 0 System mode (available for dedicated resource pool users)	N/A	N/A	N/A
	HBM ECC Detection Switch	ma_node_npu_hbm_ecc_enable	0 indicates that ECC detection is disabled. 1 indicates that ECC detection is enabled.	-	1: enabled 0: disabled	N/A	N/A	N/A
	Current HBM Single-bit Errors	ma_node_npu_hbm_single_bit_error_total	Current number of HBM single-bit errors	Number	≥ 0	N/A	N/A	N/A
	Current HBM Multi-bit Errors	ma_node_npu_hbm_double_bit_error_total	Current number of HBM multi-bit errors	Number	≥ 0	N/A	N/A	N/A
	Total Single-bit Errors in the HBM Life Cycle	ma_node_npu_hbm_total_single_bit_error_total	Total number of single-bit errors in the HBM life cycle	Number	≥ 0	N/A	N/A	N/A
	Total Multi-bit Errors in the HBM Life Cycle	ma_node_npu_hbm_total_double_bit_error_total	Total number of multi-bit errors in the HBM life cycle	Number	≥ 0	N/A	N/A	N/A
	Isolated NPU Memory Pages with HBM Single-bit Errors	ma_node_npu_hbm_single_bit_isolated_pages_total	Number of isolated NPU memory pages with HBM single-bit errors	Number	≥ 0	N/A	N/A	N/A
	Isolated NPU Memory Pages with HBM Multi-bit Errors	ma_node_npu_hbm_double_bit_isolated_pages_total	Number of isolated NPU memory pages with HBM multi-bit errors Note: If there are more than 64 pages, change the NPU.	Number	≥ 0	Raw data ≥ 64 for two consecutive periods	Critical	If there are more than 64 pages, submit a service ticket, and switch the NPU server.
	AI Vector Core Usage	ma_node_npu_vector_core_util	AI vector core usage of Ascend AI processors	%	0%–100%	N/A	N/A	N/A
	NPU Macro Packet Retransmissions	ma_node_npu_macro_retry_cnt	Number of NPU Macro packet retransmissions within a detection period (10 seconds) (supported by A3 24.1.RC2 or later)	Number	≥ 0
	Packets Received by NPU Macro	ma_node_npu_macro_rx_cnt	Number of packets received by NPU Macro within a detection period (10 seconds)	Number	≥ 0
	Invalid Packets Received by NPU Macro	ma_node_npu_macro_crc_error_cnt	Number of invalid CRC packets received by NPU Macro within a detection period (10 seconds)	Number	≥ 0
	NPU Macro BER	ma_node_npu_macro_crc_error_rate	The percentage of invalid CRC packets received by NPU Macro within a detection period	%	0%–100%
	Successful NPU Operator Retransmissions	ma_node_npu_operator_retry_success_cnt	Number of successful NPU operator retransmissions (supported by A3 Ascend HDK 24.1.RC3.3 or later)	Number	≥ 0
	Failed NPU Operator Retransmissions	ma_node_npu_operator_retry_fail_cnt	Number of failed NPU operator retransmissions (supported by A3 Ascend HDK 24.1.RC3.3 or later)	Number	≥ 0
	Number of Times that an NPU Is Used for Communication	ma_node_npu_borrow_comms_cnt	Number of times that an NPU is used for communication. The more the NPU is used, the lower the transmission efficiency is (supported by A3Ascend HDK 24.1.RC3.3 or later).	Number	≥ 0
	Number of Faulty 1520 Switches	ma_node_switch_fault_count	Number of faulty 1520 switches. This is a metric dedicated to 910C. The severity level can be 0, 1, 2, or 3.	Number	≥ 0	Raw data ≥ 0 for four consecutive periods	Major	Submit a service ticket.
NPU HCCS Links (A3 Specifications)	Available NPU L/Cs	ma_npu_hccs_avail_credit	Number of available L/Cs, which is used to measure the capability of continuing to receive data (supported by driver version 25.1.RC1 or later).	Number	≥ 0
	Total TX Bandwidth of HCCS Links	ma_node_npu_hccs_total_txbw_per_second	Total bandwidth of data sent by the NPU (supported by driver version 24.1.RC3.5 or later)	GB/s	≥ 0
	Total RX Bandwidth of HCCS Links	ma_node_npu_hccs_total_rxbw_per_second	Total bandwidth of data received by the NPU (supported by driver version 24.1.RC3.5 or later)	GB/s	≥ 0
	HCCS Link TX Bandwidth Details	ma_node_npu_hccs_txbw_per_second	Bandwidth sent by the NPU and each switch chip (supported by driver version 24.1.RC3.5 or later)	GB/s	≥ 0
	HCCS Link TX Bandwidth Details	ma_node_npu_hccs_rxbw_per_second	Bandwidth received by the NPU and each switch chip (supported by driver version 24.1.RC3.5 or later)	GB/s	≥ 0
NPU RoCE network	NPU RoCE Network Uplink Rate	ma_node_npu_roce_tx_rate_bytes_per_second	NPU RoCE network uplink rate	Bytes/s	≥ 0	N/A	N/A	N/A
	NPU RoCE Network Downlink Rate	ma_node_npu_roce_rx_rate_bytes_per_second	NPU RoCE network downlink rate	Bytes/s	≥ 0	N/A	N/A	N/A
	MAC Uplink Pause Frames	ma_node_npu_roce_mac_tx_pause_packets_total	Total number of pause frame packets sent by NPU RoCE network MAC	Number	≥ 0	N/A	N/A	N/A
	MAC Downlink Pause Frames	ma_node_npu_roce_mac_rx_pause_packets_total	Total number of pause frame packets received by NPU RoCE network MAC	Number	≥ 0	N/A	N/A	N/A
	MAC Uplink PFC Frames	ma_node_npu_roce_mac_tx_pfc_packets_total	Total number of PFC frame packets sent by NPU RoCE network MAC	Number	≥ 0	delta(ma_node_npu_roce_mac_tx_pause_packets_total[1m]) > 0	Major	Submit a service ticket.
	MAC Downlink PFC Frames	ma_node_npu_roce_mac_rx_pfc_packets_total	Total number of PFC frame packets received by NPU RoCE network MAC	Number	≥ 0	delta(ma_node_npu_roce_mac_rx_pause_packets_total[1m]) > 0	Major	Submit a service ticket.
	MAC Uplink Bad Packets	ma_node_npu_roce_mac_tx_bad_packets_total	Total number of bad packets sent by NPU RoCE network MAC	Number	≥ 0	delta(ma_node_npu_roce_mac_tx_pfc_packets_total[1m]) > 0	Major	Submit a service ticket.
	MAC Downlink Bad Packets	ma_node_npu_roce_mac_rx_bad_packets_total	Total number of bad packets received by NPU RoCE network MAC	Number	≥ 0	delta(ma_node_npu_roce_mac_rx_pfc_packets_total[1m]) > 0	Major	Submit a service ticket.
	RoCE Uplink Bad Packets	ma_node_npu_roce_tx_err_packets_total	Total number of bad packets sent by NPU RoCE	Number	≥ 0	delta(ma_node_npu_roce_mac_tx_bad_packets_total[1m]) > 0	Major	Submit a service ticket.
	RoCE Downlink Bad Packets	ma_node_npu_roce_rx_err_packets_total	Total number of bad packets received by NPU RoCE	Number	≥ 0	delta(ma_node_npu_roce_mac_rx_bad_packets_total[1m]) > 0	Major	Submit a service ticket.
	RoCE Uplink Packets	ma_node_npu_roce_tx_all_packets_total	Total number of packets sent by NPU RoCE	Number	≥ 0	delta(ma_node_npu_roce_tx_err_packets_total[1m]) > 0	Major	Submit a service ticket.
	RoCE Downlink Packets	ma_node_npu_roce_rx_all_packets_total	Total number of packets received by NPU RoCE	Number	≥ 0	delta(ma_node_npu_roce_rx_err_packets_total[1m]) > 0	Major	Submit a service ticket.
	Number of RoCE Retry Packets	ma_node_npu_roce_new_pkt_rty_total	Number of RoCE Retry Packets	Number	≥ 0	N/A	N/A	N/A
	Number of PSN Packets Received by RoCE > Expected PSN Packets, or Duplicate PSN Packets	ma_node_npu_roce_out_of_order_total	Number of PSN Packets Received by RoCE > Expected PSN Packets, or Duplicate PSN Packets	Number	≥ 0	N/A	N/A	N/A
	Number of CNP Packets Received by RoCE	ma_node_npu_roce_rx_cnp_packets_total	Number of CNP Packets Received by RoCE	Number	≥ 0	N/A	N/A	N/A
	Number of CNP Packets Sent by RoCE	ma_node_npu_roce_tx_cnp_packets_total	Number of CNP Packets Sent by RoCE	Number	≥ 0	N/A	N/A	N/A
NPU Optical Module (This metric is available for Snt9B/C air-cooled networking.)	Optical Module Temperature	ma_node_npu_optical_temperature	Optical module temperature	°C	≥ 0	N/A	N/A	N/A
	Optical Module Power Voltage	ma_node_npu_optical_vcc	Power voltage of the optical module	Millivolt (mV)	≥ 0	N/A	N/A	N/A
	Optical Module Transmit Power 0	ma_node_npu_optical_tx_power0	Transmit power 0 of the optical module	Milliwatt (mW)	≥ 0	N/A	N/A	N/A
	Optical Module Transmit Power 1	ma_node_npu_optical_tx_power1	Transmit power 1 of the optical module	Milliwatt (mW)	≥ 0	N/A	N/A	N/A
	Optical Module Transmit Power 2	ma_node_npu_optical_tx_power2	Transmit power 2 of the optical module	Milliwatt (mW)	≥ 0	N/A	N/A	N/A
	Optical Module Transmit Power 3	ma_node_npu_optical_tx_power3	Transmit power 3 of the optical module	Milliwatt (mW)	≥ 0	N/A	N/A	N/A
	Optical Module Receive Power 0	ma_node_npu_optical_rx_power0	Receive power 0 of the optical module	Milliwatt (mW)	≥ 0	N/A	N/A	N/A
	Optical Module Receive Power 1	ma_node_npu_optical_rx_power1	Receive power 1 of the optical module	Milliwatt (mW)	≥ 0	N/A	N/A	N/A
	Optical Module Receive Power 2	ma_node_npu_optical_rx_power2	Receive power 2 of the optical module	Milliwatt (mW)	≥ 0	N/A	N/A	N/A
	Optical Module Receive Power 3	ma_node_npu_optical_rx_power3	Receive power 3 of the optical module	Milliwatt (mW)	≥ 0	N/A	N/A	N/A
InfiniBand or RoCE network	Total Amount of Data Received by a NIC	ma_node_infiniband_port_received_data_bytes_total	The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.	(counting in double words, 32 bits)	≥ 0	N/A	N/A	N/A
InfiniBand or RoCE network	Total Amount of Data Sent by a NIC	ma_node_infiniband_port_transmitted_data_bytes_total	The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.	(counting in double words, 32 bits)	≥ 0	N/A	N/A	N/A
NFS mounting status	NFS Getattr Congestion Time	ma_node_mountstats_getattr_backlog_wait	Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.	ms	≥ 0	N/A	N/A	N/A
	NFS Getattr Round Trip Time	ma_node_mountstats_getattr_rtt	Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.	ms	≥ 0	N/A	N/A	N/A
	NFS Access Congestion Time	ma_node_mountstats_access_backlog_wait	Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.	ms	≥ 0	N/A	N/A	N/A
	NFS Access Round Trip Time	ma_node_mountstats_access_rtt	Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.	ms	≥ 0	N/A	N/A	N/A
	NFS Lookup Congestion Time	ma_node_mountstats_lookup_backlog_wait	Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.	ms	≥ 0	N/A	N/A	N/A
	NFS Lookup Round Trip Time	ma_node_mountstats_lookup_rtt	Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.	ms	≥ 0	N/A	N/A	N/A
	NFS Read Congestion Time	ma_node_mountstats_read_backlog_wait	Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.	ms	≥ 0	N/A	N/A	N/A
	NFS Read Round Trip Time	ma_node_mountstats_read_rtt	Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.	ms	≥ 0	N/A	N/A	N/A
	NFS Write Congestion Time	ma_node_mountstats_write_backlog_wait	Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.	ms	≥ 0	N/A	N/A	N/A
	NFS Write Round Trip Time	ma_node_mountstats_write_rtt	Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.	ms	≥ 0	N/A	N/A	N/A

Networking Metrics

**Table 3** Diagnosis (InfiniBand, collected only in dedicated resource pools)
Category	Name	Metric	Description	Unit	Value Range
InfiniBand or RoCE network	PortXmitData	infiniband_port_xmit_data_total	The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.	Total count	Natural number
	PortRcvData	infiniband_port_rcv_data_total	The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.	Total count	Natural number
	SymbolErrorCounter	infiniband_symbol_error_counter_total	Total number of minor link errors detected on one or more physical lanes.	Total count	Natural number
	LinkErrorRecoveryCounter	infiniband_link_error_recovery_counter_total	Total number of times the Port Training state machine has successfully completed the link error recovery process.	Total count	Natural number
	PortRcvErrors	infiniband_port_rcv_errors_total	Total number of packets containing errors that were received on the port including: Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine) Malformed data packet errors (LVer, length, VL) Malformed link packet errors (operand, length, VL) Packets discarded due to buffer overrun (overflow)	Total count	Natural number
	LocalLinkIntegrityErrors	infiniband_local_link_integrity_errors_total	This counter indicates the number of retries initiated by a link transfer layer receiver.	Total count	Natural number
	PortRcvRemotePhysicalErrors	infiniband_port_rcv_remote_physical_errors_total	Total number of packets marked with the EBP delimiter received on the port.	Total count	Natural number
	PortRcvSwitchRelayErrors	infiniband_port_rcv_switch_relay_errors_total	Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons: DLID mapping VL mapping Looping (output port = input port)	Total count	Natural number
	PortXmitWait	infiniband_port_transmit_wait_total	The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).	Total count	Natural number
	PortXmitDiscards	infiniband_port_xmit_discards_total	Total number of outbound packets discarded by the port because the port is down or congested.	Total count	Natural number

Label Metrics

**Table 4** Metric labels
Classification	Label	Description
Container metrics	modelarts_service	Service to which a container belongs, which can be notebook, train, or infer
	instance_name	Name of the pod to which the container belongs
	service_id	Instance or job ID displayed on the page, for example, cf55829e-9bd3-48fa-8071-7ae870dae93a for a development environment 9f322d5a-b1d2-4370-94df-5a87de27d36e for a training job
	node_ip	IP address of the node to which the container belongs
	container_id	Container ID
	cid	Cluster ID
	container_name	Container name
	project_id	Project ID of the account to which the user belongs
	user_id	User ID of the account to which the user who submits the job belongs
	pool_id	ID of a resource pool corresponding to a physical dedicated resource pool
	pool_name	Name of a resource pool corresponding to a physical dedicated resource pool
	logical_pool_id	ID of a logical subpool
	logical_pool_name	Name of a logical subpool
	gpu_uuid	UUID of the GPU used by the container
	gpu_index	Index of the GPU used by the container
	gpu_type	Type of the GPU used by the container
	account_name	Account name of the creator of a training, inference, or development environment task
	user_name	Username of the creator of a training, inference, or development environment task
	task_creation_time	Time when a training, inference, or development environment task is created
	task_name	Name of a training, inference, or development environment task
	task_spec_code	Specifications of a training, inference, or development environment task
	cluster_name	CCE cluster name
Node metrics	cid	ID of the CCE cluster to which the node belongs
	node_ip	IP address of the node
	host_name	Hostname of a node
	pool_id	ID of a resource pool corresponding to a physical dedicated resource pool
	project_id	Project ID of the user in a physical dedicated resource pool
	gpu_uuid	UUID of a node GPU
	gpu_index	Index of a node GPU
	gpu_type	Type of a node GPU
	device_name	Device name of an InfiniBand or RoCE network NIC
	port	Port number of the InfiniBand NIC
	physical_state	Status of each port on the InfiniBand NIC
	firmware_version	Firmware version of the IB NIC
	filesystem	NFS-mounted file system
	mount_point	NFS mount point
Diagnosis	cid	ID of the CCE cluster to which the node where the GPU resides belongs
	node_ip	IP address of the node where the GPU resides
	pool_id	ID of a resource pool corresponding to a physical dedicated resource pool
	project_id	Project ID of the user in a physical dedicated resource pool
	gpu_uuid	GPU UUID
	gpu_index	Index of a node GPU
	gpu_type	Type of a node GPU
	device_name	Name of a network device or disk device
	port	Port number of the InfiniBand NIC
	physical_state	Status of each port on the InfiniBand NIC
	firmware_version	Firmware version of the InfiniBand NIC