Help Center/ ModelArts/ ModelArts User Guide (Standard)/ Resource Monitoring/ Viewing All ModelArts Monitoring Metrics on the AOM Console
Updated on 2024-12-26 GMT+08:00

Viewing All ModelArts Monitoring Metrics on the AOM Console

ModelArts periodically collects the usage of key metrics (such as GPUs, NPUs, CPUs, and memory) of each node in a resource pool as well as the usage of key metrics of development environments, training jobs, and inference services, and reports the data to AOM. You can view the information on AOM.

Viewing Monitoring Metrics on the AOM Console

  1. Log in to the console and search for AOM to go to the AOM console.
  2. In the navigation pane on the left, choose Metric Browsing.
  3. Select the Prometheus_AOM_Default instance from the drop-down list.
    Figure 1 Specifying the metric source
  4. Select one or more metrics from All metrics or Prometheus statement.
    Figure 2 Adding a metric

    For details about how to view metrics, see Application Operations Management > User Guide (2.0) > Metric Browsing in the Huawei Cloud Help Center.

Container-level Metrics

Table 1 Container metrics

Category

Name

Metric

Description

Unit

Value Range

Alarm Threshold

Alarm Severity

Solution

CPU

CPU Usage

ma_container_cpu_util

CPU usage of a measured object

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

Used CPU Cores

ma_container_cpu_used_core

Number of CPU cores used by a measured object

Cores

≥ 0

N/A

N/A

N/A

Total CPU Cores

ma_container_cpu_limit_core

Total number of CPU cores that have been applied for a measured object

Cores

≥ 1

N/A

N/A

N/A

CPU Memory Usage

ma_container_gpu_mem_util

Percentage of the used GPU memory to the total GPU memory

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

Memory

Total Physical Memory

ma_container_memory_capacity_megabytes

Total physical memory that has been requested for a measured object

MB

≥ 0

N/A

N/A

N/A

Physical Memory Usage

ma_container_memory_util

Percentage of the used physical memory to the total physical memory

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

Used Physical Memory

ma_container_memory_used_megabytes

Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set)

(Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes)

MB

≥ 0

N/A

N/A

N/A

Storage

Disk Read Rate

ma_container_disk_read_kilobytes

Volume of data read from a disk per second

KB/s

≥ 0

N/A

N/A

N/A

Disk Write Rate

ma_container_disk_write_kilobytes

Volume of data written into a disk per second

KB/s

≥ 0

N/A

N/A

N/A

GPU memory

Total GPU Memory

ma_container_gpu_mem_total_megabytes

Total GPU memory of a training job

MB

> 0

N/A

N/A

N/A

GPU Memory Usage

ma_container_gpu_mem_util

Percentage of the used GPU memory to the total GPU memory

%

0%–100%

N/A

N/A

N/A

Used GPU Memory

ma_container_gpu_mem_used_megabytes

GPU memory used by a measured object

MB

≥ 0

N/A

N/A

N/A

Idle GPU Memory

ma_container_gpu_mem_free_megabytes

Idle GPU memory of a measured object

MB

≥ 0

N/A

N/A

N/A

GPU

GPU usage

ma_container_gpu_util

GPU usage of a measured object

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

GPU Memory Bandwidth Usage

ma_container_gpu_mem_copy_util

GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of GP Vnt1 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.

%

0%–100%

N/A

N/A

N/A

GPU Encoder Usage

ma_container_gpu_enc_util

GPU encoder usage of a measured object

%

%

N/A

N/A

N/A

GPU Decoder Usage

ma_container_gpu_dec_util

GPU decoder usage of a measured object

%

%

N/A

N/A

N/A

GPU Temperature

DCGM_FI_DEV_GPU_TEMP

GPU temperature

°C

Natural number

N/A

N/A

N/A

GPU Power

DCGM_FI_DEV_POWER_USAGE

GPU power

Watt (W)

> 0

N/A

N/A

N/A

GPU Memory Temperature

DCGM_FI_DEV_MEMORY_TEMP

GPU memory temperature

°C

Natural number

N/A

N/A

N/A

Network I/O

Downlink Rate

ma_container_network_receive_bytes

Inbound traffic rate of a measured object

Bytes/s

≥ 0

N/A

N/A

N/A

Packet RX Rate

ma_container_network_receive_packets

Number of data packets received by an NIC per second

Packets/s

≥ 0

N/A

N/A

N/A

Downlink Error Rate

ma_container_network_receive_error_packets

Number of error packets received by a NIC per second

Packets/s

≥ 0

Raw data > 1 for two consecutive periods

Critical

Packet loss on the network. Submit a service ticket and contact the O&M support to locate the fault.

Uplink Rate

ma_container_network_transmit_bytes

Outbound traffic rate of a measured object

Bytes/s

≥ 0

N/A

N/A

N/A

Uplink Error Rate

ma_container_network_transmit_error_packets

Number of error packets sent by a NIC per second

Packets/s

≥ 0

Raw data > 1 for two consecutive periods

Critical

Packet loss on the network. Submit a service ticket and contact the O&M support to locate the fault.

Packet TX Rate

ma_container_network_transmit_packets

Number of data packets sent by a NIC per second

Packets/s

≥ 0

N/A

N/A

N/A

NPU

NPU Usage

ma_container_npu_util

NPU usage of a measured object (To be replaced by ma_container_npu_ai_core_util)

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

NPU Memory Usage

ma_container_npu_memory_util

Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_container_npu_ddr_memory_util for Snt3 series, and ma_container_npu_hbm_util for Snt9 series)

%

0%–100%

Raw data > 98% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

Used NPU Memory

ma_container_npu_memory_used_megabytes

NPU memory used by a measured object (To be replaced by ma_container_npu_ddr_memory_usage_bytes for Snt3 series, and ma_container_npu_hbm_usage_bytes for Snt9 series)

≥ 0

MB

N/A

N/A

N/A

Total NPU Memory

ma_container_npu_memory_total_megabytes

Total NPU memory of a measured object (To be replaced by ma_container_npu_ddr_memory_bytes for Snt3 series, and ma_container_npu_hbm_bytes for Snt9 series)

> 0

MB

N/A

N/A

N/A

AI Processor Error Codes

ma_container_npu_ai_core_error_code

Error codes of Ascend AI processors

-

-

Raw data > 0 for three consecutive periods

Critical

Abnormal card. Submit a service ticket and contact the O&M support.

AI Processor Health Status

ma_container_npu_ai_core_health_status

Health status of Ascend AI processors

-

  • 1: healthy
  • 0: unhealthy

Raw data > 0 for two consecutive periods

Critical

Abnormal card. Submit a service ticket and contact the O&M support.

AI Processor Power Consumption

ma_container_npu_ai_core_power_usage_watts

Power consumption of Ascend AI processors

Watt (W)

> 0

N/A

N/A

N/A

AI Processor Temperature

ma_container_npu_ai_core_temperature_celsius

Temperature of Ascend AI processors

°C

Natural number

N/A

N/A

N/A

AI Core Usage

ma_container_npu_ai_core_util

AI core usage of Ascend AI processors

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

Overall NPU Usage

ma_container_npu_general_util

NPU usage of Ascend AI processors (supported by driver version 24.1.RC2 and later)

%

0%–100%

N/A

N/A

N/A

AI Core Clock Frequency

ma_container_npu_ai_core_frequency_hertz

AI core clock frequency of Ascend AI processors

Hertz (Hz)

> 0

N/A

N/A

N/A

AI Processor Voltage

ma_container_npu_ai_core_voltage_volts

Voltage of Ascend AI processors

Volt (V)

Natural number

N/A

N/A

N/A

AI Processor DDR Memory

ma_container_npu_ddr_memory_bytes

Total DDR memory capacity of Ascend AI processors

Byte

> 0

N/A

N/A

N/A

AI Processor DDR Usage

ma_container_npu_ddr_memory_usage_bytes

DDR memory usage of Ascend AI processors

Byte

> 0

N/A

N/A

N/A

AI Processor DDR Memory Utilization

ma_container_npu_ddr_memory_util

DDR memory utilization of Ascend AI processors

Invalid metric for Snt9C.

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

AI Processor HBM Memory

ma_container_npu_hbm_bytes

Total HBM memory of Ascend AI processors (dedicated for Snt9 processors)

Byte

> 0

N/A

N/A

N/A

AI Processor HBM Memory Usage

ma_container_npu_hbm_usage_bytes

HBM memory usage of Ascend AI processors (dedicated for Snt9 processors)

Byte

> 0

N/A

N/A

N/A

AI Processor HBM Memory Utilization

ma_container_npu_hbm_util

HBM memory utilization of Ascend AI processors (dedicated for Snt9 processors)

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

AI Processor HBM Memory Bandwidth Utilization

ma_container_npu_hbm_bandwidth_util

HBM memory bandwidth utilization of Ascend AI processors (dedicated for Snt9 processors)

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

AI Processor HBM Memory Clock Frequency

ma_container_npu_hbm_frequency_hertz

HBM memory clock frequency of Ascend AI processors (dedicated for Snt9 processors)

Hertz (Hz)

> 0

N/A

N/A

N/A

AI Processor HBM Memory Temperature

ma_container_npu_hbm_temperature_celsius

HBM memory temperature of Ascend AI processors (dedicated for Snt9 processors)

°C

Natural number

N/A

N/A

N/A

AI CPU Utilization

ma_container_npu_ai_cpu_util

AI CPU utilization of Ascend AI processors

%

0%–100%

N/A

N/A

N/A

AI Processor Control CPU Utilization

ma_container_npu_ctrl_cpu_util

Control CPU utilization of Ascend AI processors

%

0%–100%

N/A

N/A

N/A

AI Processor Control CPU Frequency

ma_node_npu_ctrl_cpu_frequency_hertz

Control CPU frequency of Ascend AI processors

Hertz (Hz)

> 0

System mode (dedicated resource pool user mode)

N/A

N/A

N/A

AI Vector Core Usage

ma_container_npu_vector_core_util

AI vector core usage of Ascend AI processors

%

0%–100%

Raw data > 95% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

NPU RoCE network

NPU RoCE Network Uplink Rate

ma_container_npu_roce_tx_rate_bytes_per_second

Uplink rate of the NPU network module used by the container

Bytes/s

≥ 0

N/A

N/A

N/A

NPU RoCE Network Downlink Rate

ma_container_npu_roce_rx_rate_bytes_per_second

Downlink rate of the NPU network module used by the container

Bytes/s

≥ 0

N/A

N/A

N/A

Notebook service metrics

Notebook Cache Directory Size

ma_container_notebook_cache_dir_size_bytes

A high-speed local disk is attached to the /cache directory for GPU and NPU notebook instances. This metric indicates the total size of the directory.

Bytes

≥ 0

N/A

N/A

N/A

Notebook Cache Directory Utilization

ma_container_notebook_cache_dir_util

A high-speed local disk is attached to the /cache directory for GPU and NPU notebook instances. This metric indicates the utilization of the directory.

%

0%–100%

Raw data > 90% for two consecutive periods

Major

If the disk usage is too high, the notebook instance will be restarted.

Node-level Metrics

Table 2 Node metrics (collected only in dedicated resource pools)

Category

Name

Metric

Description

Unit

Value Range

Alarm Threshold

Alarm Severity

Solution

CPU

Total CPU Cores

ma_node_cpu_limit_core

Total number of CPU cores that have been requested for a measured object

Cores

≥ 1

N/A

N/A

N/A

Used CPU Cores

ma_node_cpu_used_core

Number of CPU cores used by a measured object

Cores

≥ 0

N/A

N/A

N/A

CPU Usage

ma_node_cpu_util

CPU usage of a measured object

%

0%–100%

Raw data > 95% for two consecutive periods

Major

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

CPU I/O Wait Time

ma_node_cpu_iowait_counter

Disk I/O wait time accumulated since system startup

jiffies

≥ 0

N/A

N/A

N/A

Memory

Physical Memory Usage

ma_node_memory_util

Percentage of the used physical memory to the total physical memory

%

0%–100%

Raw data > 95% for two consecutive periods

Major

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

Total Physical Memory

ma_node_memory_total_megabytes

Total physical memory that has been applied for a measured object

MB

≥ 0

N/A

N/A

N/A

Network I/O

Downlink Rate (BPS)

ma_node_network_receive_rate_bytes_seconds

Inbound traffic rate of a measured object

Bytes/s

≥ 0

N/A

N/A

N/A

Uplink Rate (BPS)

ma_node_network_transmit_rate_bytes_seconds

Outbound traffic rate of a measured object

Bytes/s

≥ 0

N/A

N/A

N/A

Storage

Disk Read Rate

ma_node_disk_read_rate_kilobytes_seconds

Volume of data read from a disk per second (Only data disks used by containers are collected.)

KB/s

≥ 0

N/A

N/A

N/A

Disk Write Rate

ma_node_disk_write_rate_kilobytes_seconds

Volume of data written into a disk per second (Only data disks used by containers are collected.)

KB/s

≥ 0

N/A

N/A

N/A

Total Cache

ma_node_cache_space_capacity_megabytes

Total cache of the Kubernetes space

MB

≥ 0

N/A

N/A

N/A

Used Cache

ma_node_cache_space_used_capacity_megabytes

Used cache of the Kubernetes space

MB

≥ 0

N/A

N/A

N/A

Cache Usage

ma_node_cache_space_used_percent

Cache usage of the Kubernetes space

%

≥ 0

Raw data > 90% for two consecutive periods

Critical

Check the disk in a timely manner to avoid affecting services. Clear invalid data on compute nodes.

Total Container Space

ma_node_container_space_capacity_megabytes

Total container space

MB

≥ 0

N/A

N/A

N/A

Used Container Space

ma_node_container_space_used_capacity_megabytes

Used container space

MB

≥ 0

N/A

N/A

N/A

Container Space Usage

ma_node_container_space_used_percent

Space usage of a container

%

≥ 0

Raw data > 90% for two consecutive periods

Critical

Check the disk in a timely manner to avoid affecting services. Clear invalid data on compute nodes.

Disk Information

ma_node_disk_info

Basic disk information

-

≥ 0

N/A

N/A

N/A

Total Reads

ma_node_disk_reads_completed_total

Total number of successful reads

-

≥ 0

N/A

N/A

N/A

Merged Reads

ma_node_disk_reads_merged_total

Number of merged reads

-

≥ 0

N/A

N/A

N/A

Bytes Read

ma_node_disk_read_bytes_total

Total number of bytes that are successfully read

Bytes

≥ 0

N/A

N/A

N/A

Read Time Spent

ma_node_disk_read_time_seconds_total

Time spent on all reads

Seconds

≥ 0

N/A

N/A

N/A

Total Writes

ma_node_disk_writes_completed_total

Total number of successful writes

-

≥ 0

N/A

N/A

N/A

Merged Writes

ma_node_disk_writes_merged_total

Number of merged writes

-

≥ 0

N/A

N/A

N/A

Written Bytes

ma_node_disk_written_bytes_total

Total number of bytes that are successfully written

Bytes

≥ 0

N/A

N/A

N/A

Write Time Spent

ma_node_disk_write_time_seconds_total

Time spent on all write operations

Seconds

≥ 0

N/A

N/A

N/A

Ongoing I/Os

ma_node_disk_io_now

Number of ongoing I/Os

-

≥ 0

N/A

N/A

N/A

I/O Execution Duration

ma_node_disk_io_time_seconds_total

Time spent on executing I/Os

Seconds

≥ 0

N/A

N/A

N/A

I/O Execution Weighted Time

ma_node_disk_io_time_weighted_seconds_tota

Weighted time spent on executing I/Os

Seconds

≥ 0

N/A

N/A

N/A

GPU

GPU Usage

ma_node_gpu_util

GPU usage of a measured object

%

0%–100%

N/A

N/A

N/A

Total GPU Memory

ma_node_gpu_mem_total_megabytes

Total GPU memory of a measured object

MB

> 0

N/A

N/A

N/A

GPU Memory Usage

ma_node_gpu_mem_util

Percentage of the used GPU memory to the total GPU memory

%

0%–100%

Raw data > 97% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

Used GPU Memory

ma_node_gpu_mem_used_megabytes

GPU memory used by a measured object

MB

≥ 0

N/A

N/A

N/A

Idle GPU Memory

ma_node_gpu_mem_free_megabytes

Idle GPU memory of a measured object

MB

> 0

N/A

N/A

N/A

Tasks on a Shared GPU

node_gpu_share_job_count

Number of tasks running on a shared GPU

Number

≥ 0

N/A

N/A

N/A

GPU Temperature

DCGM_FI_DEV_GPU_TEMP

GPU temperature

°C

Natural number

N/A

N/A

N/A

GPU Power

DCGM_FI_DEV_POWER_USAGE

GPU power

Watt (W)

> 0

N/A

N/A

N/A

GPU Memory Temperature

DCGM_FI_DEV_MEMORY_TEMP

GPU memory temperature

°C

Natural number

N/A

N/A

N/A

NPU

NPU Usage

ma_node_npu_util

NPU usage of a measured object (To be replaced by ma_node_npu_ai_core_util)

%

0%–100%

N/A

N/A

N/A

NPU Memory Usage

ma_node_npu_memory_util

Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_node_npu_ddr_memory_util for Snt3 series, and ma_node_npu_hbm_util for Snt9 series)

%

0%–100%

Raw data > 97% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

Used NPU Memory

ma_node_npu_memory_used_megabytes

NPU memory used by a measured object (To be replaced by ma_node_npu_ddr_memory_usage_bytes for Snt3 series, and ma_node_npu_hbm_usage_bytes for Snt9 series)

≥ 0

MB

N/A

N/A

N/A

Total NPU Memory

ma_node_npu_memory_total_megabytes

Total NPU memory of a measured object (To be replaced by ma_node_npu_ddr_memory_bytes for Snt3 series, and ma_node_npu_hbm_bytes for Snt9 series)

> 0

MB

N/A

N/A

N/A

AI Processor Error Codes

ma_node_npu_ai_core_error_code

Error codes of Ascend AI processors

-

-

N/A

N/A

N/A

AI Processor Health Status

ma_node_npu_ai_core_health_status

Health status of Ascend AI processors

-

  • 1: healthy
  • 0: unhealthy

The value is 0 for two consecutive periods.

Critical

Submit a service ticket.

AI Processor Power Consumption

ma_node_npu_ai_core_power_usage_watts

Power consumption of Ascend AI processors

Watt (W)

> 0

N/A

N/A

N/A

AI Processor Temperature

ma_node_npu_ai_core_temperature_celsius

Temperature of Ascend AI processors

°C

Natural number

N/A

N/A

N/A

AI Core Usage

ma_node_npu_ai_core_util

AI core usage of Ascend AI processors

%

0%–100%

N/A

N/A

N/A

Overall NPU Usage

ma_node_npu_general_util

NPU usage of Ascend AI processors (supported by driver version 24.1.RC2 and later)

%

0%–100%

N/A

N/A

N/A

AI Core Clock Frequency

ma_node_npu_ai_core_frequency_hertz

AI core clock frequency of Ascend AI processors

Hertz (Hz)

> 0

N/A

N/A

N/A

AI Processor Voltage

ma_node_npu_ai_core_voltage_volts

Voltage of Ascend AI processors

Volt (V)

Natural number

N/A

N/A

N/A

AI Processor DDR Memory

ma_node_npu_ddr_memory_bytes

Total DDR memory capacity of Ascend AI processors

Invalid metric for Snt9C.

Byte

> 0

N/A

N/A

N/A

AI Processor DDR Usage

ma_node_npu_ddr_memory_usage_bytes

DDR memory usage of Ascend AI processors

Byte

> 0

N/A

N/A

N/A

AI Processor DDR Memory Utilization

ma_node_npu_ddr_memory_util

DDR memory utilization of Ascend AI processors

%

0%–100%

Raw data > 90% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

AI Processor HBM Memory

ma_node_npu_hbm_bytes

Total HBM memory of Ascend AI processors (dedicated for Snt9 processors)

Byte

> 0

N/A

N/A

N/A

AI Processor HBM Memory Usage

ma_node_npu_hbm_usage_bytes

HBM memory usage of Ascend AI processors (dedicated for Snt9 processors)

Byte

> 0

N/A

N/A

N/A

AI Processor HBM Memory Utilization

ma_node_npu_hbm_util

HBM memory utilization of Ascend AI processors (dedicated for Snt9 processors)

%

0%–100%

Raw data > 97% for two consecutive periods

Suggestion

Check whether the service resource usage meets the expectation. If the service is normal, no action is required.

AI Processor HBM Memory Bandwidth Utilization

ma_node_npu_hbm_bandwidth_util

HBM memory bandwidth utilization of Ascend AI processors (dedicated for Snt9 processors)

%

0%–100%

N/A

N/A

N/A

AI Processor HBM Memory Clock Frequency

ma_node_npu_hbm_frequency_hertz

HBM memory clock frequency of Ascend AI processors (dedicated for Snt9 processors)

Hertz (Hz)

> 0

N/A

N/A

N/A

AI Processor HBM Memory Temperature

ma_node_npu_hbm_temperature_celsius

HBM memory temperature of Ascend AI processors (dedicated for Snt9 processors)

°C

Natural number

N/A

N/A

N/A

AI CPU Utilization

ma_node_npu_ai_cpu_util

AI CPU utilization of Ascend AI processors

%

0%–100%

N/A

N/A

N/A

AI Processor Control CPU Utilization

ma_node_npu_ctrl_cpu_util

Control CPU utilization of Ascend AI processors

%

0%–100%

N/A

N/A

N/A

AI Processor Control CPU Frequency

ma_node_npu_ctrl_cpu_frequency_hertz

Control CPU frequency of Ascend AI processors

Hertz (Hz)

> 0

System mode (available for dedicated resource pool users)

N/A

N/A

N/A

HBM ECC Detection Switch

ma_node_npu_hbm_ecc_enable

0 indicates that ECC detection is disabled. 1 indicates that ECC detection is enabled.

-

  • 1: enabled
  • 0: disabled

N/A

N/A

N/A

Current HBM Single-bit Errors

ma_node_npu_hbm_single_bit_error_total

Current number of HBM single-bit errors

Number

≥ 0

N/A

N/A

N/A

Current HBM Multi-bit Errors

ma_node_npu_hbm_double_bit_error_total

Current number of HBM multi-bit errors

Number

≥ 0

N/A

N/A

N/A

Total Single-bit Errors in the HBM Life Cycle

ma_node_npu_hbm_total_single_bit_error_total

Total number of single-bit errors in the HBM life cycle

Number

≥ 0

N/A

N/A

N/A

Total Multi-bit Errors in the HBM Life Cycle

ma_node_npu_hbm_total_double_bit_error_total

Total number of multi-bit errors in the HBM life cycle

Number

≥ 0

N/A

N/A

N/A

Isolated NPU Memory Pages with HBM Single-bit Errors

ma_node_npu_hbm_single_bit_isolated_pages_total

Number of isolated NPU memory pages with HBM single-bit errors

Number

≥ 0

N/A

N/A

N/A

Isolated NPU Memory Pages with HBM Multi-bit Errors

ma_node_npu_hbm_double_bit_isolated_pages_total

Number of isolated NPU memory pages with HBM multi-bit errors Note:

If there are more than 64 pages, change the NPU.

Number

≥ 0

Raw data ≥ 64 for two consecutive periods

Critical

If there are more than 64 pages, submit a service ticket, and switch the NPU server.

AI Vector Core Usage

ma_node_npu_vector_core_util

AI vector core usage of Ascend AI processors

%

0%–100%

N/A

N/A

N/A

NPU RoCE network

NPU RoCE Network Uplink Rate

ma_node_npu_roce_tx_rate_bytes_per_second

NPU RoCE network uplink rate

Bytes/s

≥ 0

N/A

N/A

N/A

NPU RoCE Network Downlink Rate

ma_node_npu_roce_rx_rate_bytes_per_second

NPU RoCE network downlink rate

Bytes/s

≥ 0

N/A

N/A

N/A

MAC Uplink Pause Frames

ma_node_npu_roce_mac_tx_pause_packets_total

Total number of pause frame packets sent by NPU RoCE network MAC

Number

≥ 0

N/A

N/A

N/A

MAC Downlink Pause Frames

ma_node_npu_roce_mac_rx_pause_packets_total

Total number of pause frame packets received by NPU RoCE network MAC

Number

≥ 0

N/A

N/A

N/A

MAC Uplink PFC Frames

ma_node_npu_roce_mac_tx_pfc_packets_total

Total number of PFC frame packets sent by NPU RoCE network MAC

Number

≥ 0

delta(ma_node_npu_roce_mac_tx_pause_packets_total[1m]) > 0

Major

Submit a service ticket.

MAC Downlink PFC Frames

ma_node_npu_roce_mac_rx_pfc_packets_total

Total number of PFC frame packets received by NPU RoCE network MAC

Number

≥ 0

delta(ma_node_npu_roce_mac_rx_pause_packets_total[1m]) > 0

Major

Submit a service ticket.

MAC Uplink Bad Packets

ma_node_npu_roce_mac_tx_bad_packets_total

Total number of bad packets sent by NPU RoCE network MAC

Number

≥ 0

delta(ma_node_npu_roce_mac_tx_pfc_packets_total[1m]) > 0

Major

Submit a service ticket.

MAC Downlink Bad Packets

ma_node_npu_roce_mac_rx_bad_packets_total

Total number of bad packets received by NPU RoCE network MAC

Number

≥ 0

delta(ma_node_npu_roce_mac_rx_pfc_packets_total[1m]) > 0

Major

Submit a service ticket.

RoCE Uplink Bad Packets

ma_node_npu_roce_tx_err_packets_total

Total number of bad packets sent by NPU RoCE

Number

≥ 0

delta(ma_node_npu_roce_mac_tx_bad_packets_total[1m]) > 0

Major

Submit a service ticket.

RoCE Downlink Bad Packets

ma_node_npu_roce_rx_err_packets_total

Total number of bad packets received by NPU RoCE

Number

≥ 0

delta(ma_node_npu_roce_mac_rx_bad_packets_total[1m]) > 0

Major

Submit a service ticket.

RoCE Uplink Packets

ma_node_npu_roce_tx_all_packets_total

Total number of packets sent by NPU RoCE

Number

≥ 0

delta(ma_node_npu_roce_tx_err_packets_total[1m]) > 0

Major

Submit a service ticket.

RoCE Downlink Packets

ma_node_npu_roce_rx_all_packets_total

Total number of packets received by NPU RoCE

Number

≥ 0

delta(ma_node_npu_roce_rx_err_packets_total[1m]) > 0

Major

Submit a service ticket.

NPU optical module (This metric is available for Snt9B/C air-cooled networking.)

Optical Module Temperature

ma_node_npu_optical_temperature

Optical module temperature

°C

≥ 0

N/A

N/A

N/A

Optical Module Power Voltage

ma_node_npu_optical_vcc

Power voltage of the optical module

Millivolt (mV)

≥ 0

N/A

N/A

N/A

Optical Module Transmit Power 0

ma_node_npu_optical_tx_power0

Transmit power 0 of the optical module

Milliwatt (mW)

≥ 0

N/A

N/A

N/A

Optical Module Transmit Power 1

ma_node_npu_optical_tx_power1

Transmit power 1 of the optical module

Milliwatt (mW)

≥ 0

N/A

N/A

N/A

Optical Module Transmit Power 2

ma_node_npu_optical_tx_power2

Transmit power 2 of the optical module

Milliwatt (mW)

≥ 0

N/A

N/A

N/A

Optical Module Transmit Power 3

ma_node_npu_optical_tx_power3

Transmit power 3 of the optical module

Milliwatt (mW)

≥ 0

N/A

N/A

N/A

Optical Module Receive Power 0

ma_node_npu_optical_rx_power0

Receive power 0 of the optical module

Milliwatt (mW)

≥ 0

N/A

N/A

N/A

Optical Module Receive Power 1

ma_node_npu_optical_rx_power1

Receive power 1 of the optical module

Milliwatt (mW)

≥ 0

N/A

N/A

N/A

Optical Module Receive Power 2

ma_node_npu_optical_rx_power2

Receive power 2 of the optical module

Milliwatt (mW)

≥ 0

N/A

N/A

N/A

Optical Module Receive Power 3

ma_node_npu_optical_rx_power3

Receive power 3 of the optical module

Milliwatt (mW)

≥ 0

N/A

N/A

N/A

InfiniBand or RoCE network

Total Amount of Data Received by a NIC

ma_node_infiniband_port_received_data_bytes_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

(counting in double words, 32 bits

≥ 0

N/A

N/A

N/A

Total Amount of Data Sent by a NIC

ma_node_infiniband_port_transmitted_data_bytes_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

(counting in double words, 32 bits

≥ 0

N/A

N/A

N/A

NFS mounting status

NFS Getattr Congestion Time

ma_node_mountstats_getattr_backlog_wait

Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

N/A

N/A

N/A

NFS Getattr Round Trip Time

ma_node_mountstats_getattr_rtt

Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc.

RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

N/A

N/A

N/A

NFS Access Congestion Time

ma_node_mountstats_access_backlog_wait

Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

N/A

N/A

N/A

NFS Access Round Trip Time

ma_node_mountstats_access_rtt

Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

N/A

N/A

N/A

NFS Lookup Congestion Time

ma_node_mountstats_lookup_backlog_wait

Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

N/A

N/A

N/A

NFS Lookup Round Trip Time

ma_node_mountstats_lookup_rtt

Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

N/A

N/A

N/A

NFS Read Congestion Time

ma_node_mountstats_read_backlog_wait

Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

N/A

N/A

N/A

NFS Read Round Trip Time

ma_node_mountstats_read_rtt

Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

N/A

N/A

N/A

NFS Write Congestion Time

ma_node_mountstats_write_backlog_wait

Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

N/A

N/A

N/A

NFS Write Round Trip Time

ma_node_mountstats_write_rtt

Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

N/A

N/A

N/A

Networking Metrics

Table 3 Diagnosis (InfiniBand, collected only in dedicated resource pools)

Category

Name

Metric

Description

Unit

Value Range

InfiniBand or RoCE network

PortXmitData

infiniband_port_xmit_data_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

Total count

Natural number

PortRcvData

infiniband_port_rcv_data_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

Total count

Natural number

SymbolErrorCounter

infiniband_symbol_error_counter_total

Total number of minor link errors detected on one or more physical lanes.

Total count

Natural number

LinkErrorRecoveryCounter

infiniband_link_error_recovery_counter_total

Total number of times the Port Training state machine has successfully completed the link error recovery process.

Total count

Natural number

PortRcvErrors

infiniband_port_rcv_errors_total

Total number of packets containing errors that were received on the port including:

Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

Malformed data packet errors (LVer, length, VL)

Malformed link packet errors (operand, length, VL)

Packets discarded due to buffer overrun (overflow)

Total count

Natural number

LocalLinkIntegrityErrors

infiniband_local_link_integrity_errors_total

This counter indicates the number of retries initiated by a link transfer layer receiver.

Total count

Natural number

PortRcvRemotePhysicalErrors

infiniband_port_rcv_remote_physical_errors_total

Total number of packets marked with the EBP delimiter received on the port.

Total count

Natural number

PortRcvSwitchRelayErrors

infiniband_port_rcv_switch_relay_errors_total

Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:

DLID mapping

VL mapping

Looping (output port = input port)

Total count

Natural number

PortXmitWait

infiniband_port_transmit_wait_total

The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).

Total count

Natural number

PortXmitDiscards

infiniband_port_xmit_discards_total

Total number of outbound packets discarded by the port because the port is down or congested.

Total count

Natural number

Label Metrics

Table 4 Metric labels

Classification

Label

Description

Container metrics

modelarts_service

Service to which a container belongs, which can be notebook, train, or infer

instance_name

Name of the pod to which the container belongs

service_id

Instance or job ID displayed on the page, for example, cf55829e-9bd3-48fa-8071-7ae870dae93a for a development environment

9f322d5a-b1d2-4370-94df-5a87de27d36e for a training job

node_ip

IP address of the node to which the container belongs

container_id

Container ID

cid

Cluster ID

container_name

Container name

project_id

Project ID of the account to which the user belongs

user_id

User ID of the account to which the user who submits the job belongs

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

pool_name

Name of a resource pool corresponding to a physical dedicated resource pool

logical_pool_id

ID of a logical subpool

logical_pool_name

Name of a logical subpool

gpu_uuid

UUID of the GPU used by the container

gpu_index

Index of the GPU used by the container

gpu_type

Type of the GPU used by the container

account_name

Account name of the creator of a training, inference, or development environment task

user_name

Username of the creator of a training, inference, or development environment task

task_creation_time

Time when a training, inference, or development environment task is created

task_name

Name of a training, inference, or development environment task

task_spec_code

Specifications of a training, inference, or development environment task

cluster_name

CCE cluster name

Node metrics

cid

ID of the CCE cluster to which the node belongs

node_ip

IP address of the node

host_name

Hostname of a node

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

project_id

Project ID of the user in a physical dedicated resource pool

gpu_uuid

UUID of a node GPU

gpu_index

Index of a node GPU

gpu_type

Type of a node GPU

device_name

Device name of an InfiniBand or RoCE network NIC

port

Port number of the InfiniBand NIC

physical_state

Status of each port on the InfiniBand NIC

firmware_version

Firmware version of the IB NIC

filesystem

NFS-mounted file system

mount_point

NFS mount point

Diagnos

cid

ID of the CCE cluster to which the node where the GPU resides belongs

node_ip

IP address of the node where the GPU resides

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

project_id

Project ID of the user in a physical dedicated resource pool

gpu_uuid

GPU UUID

gpu_index

Index of a node GPU

gpu_type

Type of a node GPU

device_name

Name of a network device or disk device

port

Port number of the InfiniBand NIC

physical_state

Status of each port on the InfiniBand NIC

firmware_version

Firmware version of the InfiniBand NIC