Updated on 2024-11-11 GMT+08:00

Viewing Lite Cluster Monitoring Metrics on AOM

Monitoring Existing Metrics

ModelArts periodically collects the usage data of key resources (such as GPUs, NPUs, CPUs, and memory) for each node in the resource pool and reports this data to AOM. You can view the default basic metrics on AOM. The procedure is as follows:

  1. Log in to the console and search for AOM to go to the AOM console.
  2. Choose Monitoring > Metric Monitoring. On the Metric Monitoring page that is displayed, click Add Metric.
    Figure 1 Example
  3. Add a metric for query.
    • Add By: Select Dimension.
    • Metric Name: Click Custom Metrics and select the desired ones for query. For details, see Table 1 and Table 2.
    • Dimension: Enter the tag of the metric.
  4. Click Confirm. The metric information is displayed.
Table 1 Container metrics

Classification

Name

Metric

Description

Unit

Value Range

CPU

CPU Usage

ma_container_cpu_util

CPU usage of a measured object

%

0%–100%

Used CPU Cores

ma_container_cpu_used_core

Number of CPU cores used by a measured object

Core

≥0

Total CPU Cores

ma_container_cpu_limit_core

Total number of CPU cores that have been applied for a measured object

Core

≥1

Memory

Total Physical Memory

ma_container_memory_capacity_megabytes

Total physical memory that has been applied for a measured object

MB

≥0

Physical Memory Usage

ma_container_memory_util

Percentage of the used physical memory to the total physical memory

%

0%–100%

Used Physical Memory

ma_container_memory_used_megabytes

Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set) (Memory usage in a working set = Active anonymous AND cache, and file-baked page ≤ container_memory_usage_bytes)

MB

≥0

Storage

Disk Read Rate

ma_container_disk_read_kilobytes

Volume of data read from a disk per second

KB/s

≥0

Disk Write Rate

ma_container_disk_write_kilobytes

Volume of data written into a disk per second

KB/s

≥0

GPU memory

Total GPU Memory

ma_container_gpu_mem_total_megabytes

Total GPU memory of a training job

MB

>0

GPU Memory Usage

ma_container_gpu_mem_util

Percentage of the used GPU memory to the total GPU memory

%

0%–100%

Used GPU Memory

ma_container_gpu_mem_used_megabytes

GPU memory used by a measured object

MB

≥0

GPU

GPU Usage

ma_container_gpu_util

GPU usage of a measured object

%

0%–100%

GPU Memory Bandwidth Usage

ma_container_gpu_mem_copy_util

GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of NVIDIA GP Vnt1 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.

%

0%–100%

GPU Encoder Usage

ma_container_gpu_enc_util

GPU encoder usage of a measured object

%

%

GPU Decoder Usage

ma_container_gpu_dec_util

GPU decoder usage of a measured object

%

%

GPU Temperature

DCGM_FI_DEV_GPU_TEMP

GPU temperature

°C

Natural number

GPU Power

DCGM_FI_DEV_POWER_USAGE

GPU power

Watt (W)

>0

GPU Memory Temperature

DCGM_FI_DEV_MEMORY_TEMP

GPU memory temperature

°C

Natural number

Network I/O

Downlink rate

ma_container_network_receive_bytes

Inbound traffic rate of a measured object

Bytes/s

≥0

Packet receive rate

ma_container_network_receive_packets

Number of data packets received by a NIC per second

Packets/s

≥0

Downlink Error Rate

ma_container_network_receive_error_packets

Number of error packets received by a NIC per second

Packets/s

≥0

Uplink rate

ma_container_network_transmit_bytes

Outbound traffic rate of a measured object

Bytes/s

≥0

Uplink Error Rate

ma_container_network_transmit_error_packets

Number of error packets sent by a NIC per second

Packets/s

≥0

Packet send rate

ma_container_network_transmit_packets

Number of data packets sent by a NIC per second

Packets/s

≥0

NPU

NPU Usage

ma_container_npu_util

NPU usage of a measured object (To be replaced by ma_container_npu_ai_core_util)

%

0%–100%

NPU Memory Usage

ma_container_npu_memory_util

Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_container_npu_ddr_memory_util for snt3 series, and ma_container_npu_hbm_util for snt9 series)

%

0%–100%

Used NPU Memory

ma_container_npu_memory_used_megabytes

NPU memory used by a measured object (To be replaced by ma_container_npu_ddr_memory_usage_bytes for snt3 series, and ma_container_npu_hbm_usage_bytes for snt9 series)

≥0

MB

Total NPU Memory

ma_container_npu_memory_total_megabytes

Total NPU memory of a measured object

(To be replaced by ma_container_npu_ddr_memory_bytes for snt3 series, and ma_container_npu_hbm_bytes for snt9 series)

>0

MB

AI Processor Error Codes

ma_container_npu_ai_core_error_code

Error codes of Ascend AI processors

N/A

N/A

AI Processor Health Status

ma_container_npu_ai_core_health_status

Health status of Ascend AI processors

N/A

  • 1: healthy
  • 0: unhealthy

AI Processor Power Consumption

ma_container_npu_ai_core_power_usage_watts

Power consumption of Ascend AI processors (processor power consumption for snt9 and snt3, and card power consumption for snt3P)

Watt (W)

>0

AI Processor Temperature

ma_container_npu_ai_core_temperature_celsius

Temperature of Ascend AI processors

°C

Natural number

AI Core Usage

ma_container_npu_ai_core_util

AI core usage of Ascend AI processors

%

0%–100%

AI Core Clock Frequency

ma_container_npu_ai_core_frequency_hertz

AI core clock frequency of Ascend AI processors

Hertz (Hz)

>0

AI Processor Voltage

ma_container_npu_ai_core_voltage_volts

Voltage of Ascend AI processors

Volt (V)

Natural number

AI Processor DDR Memory

ma_container_npu_ddr_memory_bytes

Total DDR memory capacity of Ascend AI processors

Byte

>0

AI Processor DDR Usage

ma_container_npu_ddr_memory_usage_bytes

DDR memory usage of Ascend AI processors

Byte

>0

AI Processor DDR Memory Utilization

ma_container_npu_ddr_memory_util

DDR memory utilization of Ascend AI processors

%

0%–100%

AI Processor HBM Memory

ma_container_npu_hbm_bytes

Total HBM memory of Ascend AI processors (dedicated for Ascend snt9 processors)

Byte

>0

AI Processor HBM Memory Usage

ma_container_npu_hbm_usage_bytes

HBM memory usage of Ascend AI processors (dedicated for Ascend snt9 processors)

Byte

>0

AI Processor HBM Memory Utilization

ma_container_npu_hbm_util

HBM memory utilization of Ascend AI processors (dedicated for Ascend snt9 processors)

%

0%–100%

AI Processor HBM Memory Bandwidth Utilization

ma_container_npu_hbm_bandwidth_util

HBM memory bandwidth utilization of Ascend AI processors (dedicated for Ascend snt9 AI processors)

%

0%–100%

AI Processor HBM Memory Clock Frequency

ma_container_npu_hbm_frequency_hertz

HBM memory clock frequency of Ascend AI processors (dedicated for Ascend snt9 processors)

Hertz (Hz)

>0

AI Processor HBM Memory Temperature

ma_container_npu_hbm_temperature_celsius

HBM memory temperature of Ascend AI processors (dedicated for Ascend snt9 processors)

°C

Natural number

AI CPU Utilization

ma_container_npu_ai_cpu_util

AI CPU utilization of Ascend AI processors

%

0%–100%

AI Processor Control CPU Utilization

ma_container_npu_ctrl_cpu_util

Control CPU utilization of Ascend AI processors

%

0%–100%

Table 2 Node metric

Classification

Name

Metric

Description

Unit

Value Range

CPU

Total CPU Cores

ma_node_cpu_limit_core

Total number of CPU cores that have been applied for a measured object

Core

≥1

Used CPU Cores

ma_node_cpu_used_core

Number of CPU cores used by a measured object

Core

≥0

CPU Usage

ma_node_cpu_util

CPU usage of a measured object

%

0%–100%

CPU I/O Wait Time

ma_node_cpu_iowait_counter

Disk I/O wait time accumulated since system startup

jiffies

≥0

Memory

Physical Memory Usage

ma_node_memory_util

Percentage of the used physical memory to the total physical memory

%

0%–100%

Total Physical Memory

ma_node_memory_total_megabytes

Total physical memory that has been applied for a measured object

MB

≥0

Network I/O

Downlink Rate (BPS)

ma_node_network_receive_rate_bytes_seconds

Inbound traffic rate of a measured object

Bytes/s

≥0

Uplink Rate (BPS)

ma_node_network_transmit_rate_bytes_seconds

Outbound traffic rate of a measured object

Bytes/s

≥0

Storage

Disk Read Rate

ma_node_disk_read_rate_kilobytes_seconds

Volume of data read from a disk per second (Only data disks used by containers are collected.)

KB/s

≥0

Disk Write Rate

ma_node_disk_write_rate_kilobytes_seconds

Volume of data written into a disk per second (Only data disks used by containers are collected.)

KB/s

≥0

Total Cache

ma_node_cache_space_capacity_megabytes

Total cache of the Kubernetes space

MB

≥0

Used Cache

ma_node_cache_space_used_capacity_megabytes

Used cache of the Kubernetes space

MB

≥0

Total Container Space

ma_node_container_space_capacity_megabytes

Total container space

MB

≥0

Used Container Space

ma_node_container_space_used_capacity_megabytes

Used container space

MB

≥0

GPU

GPU Usage

ma_node_gpu_util

GPU usage of a measured object

%

0%–100%

Total GPU Memory

ma_node_gpu_mem_total_megabytes

Total GPU memory of a measured object

MB

>0

GPU Memory Usage

ma_node_gpu_mem_util

Percentage of the used GPU memory to the total GPU memory

%

0%–100%

Used GPU Memory

ma_node_gpu_mem_used_megabytes

GPU memory used by a measured object

MB

≥0

Tasks on a Shared GPU

node_gpu_share_job_count

Number of tasks running on a shared GPU

Number

≥0

GPU Temperature

DCGM_FI_DEV_GPU_TEMP

GPU temperature

°C

Natural number

GPU Power

DCGM_FI_DEV_POWER_USAGE

GPU power

Watt (W)

>0

GPU Memory Temperature

DCGM_FI_DEV_MEMORY_TEMP

GPU memory temperature

°C

Natural number

NPU

NPU Usage

ma_node_npu_util

NPU usage of a measured object (To be replaced by ma_node_npu_ai_core_util)

%

0%–100%

NPU Memory Usage

ma_node_npu_memory_util

Percentage of the used NPU memory to the total NPU memory (To be replaced by ma_node_npu_ddr_memory_util for snt3 series, and ma_node_npu_hbm_util for snt9 series)

%

0%–100%

Used NPU Memory

ma_node_npu_memory_used_megabytes

NPU memory used by a measured object (To be replaced by ma_node_npu_ddr_memory_usage_bytes for snt3 series, and ma_node_npu_hbm_usage_bytes for snt9 series)

MB

≥0

Total NPU Memory

ma_node_npu_memory_total_megabytes

Total NPU memory of a measured object (To be replaced by ma_node_npu_ddr_memory_bytes for snt3 series, and ma_node_npu_hbm_bytes for snt9 series)

MB

>0

AI Processor Error Codes

ma_node_npu_ai_core_error_code

Error codes of Ascend AI processors

N/A

N/A

AI Processor Health Status

ma_node_npu_ai_core_health_status

Health status of Ascend AI processors

N/A

  • 1: healthy
  • 0: unhealthy

AI Processor Power Consumption

ma_node_npu_ai_core_power_usage_watts

Power consumption of Ascend AI processors (processor power consumption for snt9 and snt3, and card power consumption for snt3P)

Watt (W)

>0

AI Processor Temperature

ma_node_npu_ai_core_temperature_celsius

Temperature of Ascend AI processors

°C

Natural number

AI Processor Fan Speed

ma_node_npu_fan_speed_rpm

Fan speed of the Ascend series AI processors

RPM

Natural number

AI Core Usage

ma_node_npu_ai_core_util

AI core usage of Ascend AI processors

%

0%–100%

AI Core Clock Frequency

ma_node_npu_ai_core_frequency_hertz

AI core clock frequency of Ascend AI processors

Hertz (Hz)

>0

AI Processor Voltage

ma_node_npu_ai_core_voltage_volts

Voltage of Ascend AI processors

Volt (V)

Natural number

AI Processor DDR Memory

ma_node_npu_ddr_memory_bytes

Total DDR memory capacity of Ascend AI processors

Byte

>0

AI Processor DDR Usage

ma_node_npu_ddr_memory_usage_bytes

DDR memory usage of Ascend AI processors

Byte

>0

AI Processor DDR Memory Utilization

ma_node_npu_ddr_memory_util

DDR memory utilization of Ascend AI processors

%

0%–100%

AI Processor HBM Memory

ma_node_npu_hbm_bytes

Total HBM memory of Ascend AI processors (dedicated for Ascend snt9 processors)

Byte

>0

AI Processor HBM Memory Usage

ma_node_npu_hbm_usage_bytes

HBM memory usage of Ascend AI processors (dedicated for Ascend snt9 processors)

Byte

>0

AI Processor HBM Memory Utilization

ma_node_npu_hbm_util

HBM memory utilization of Ascend AI processors (dedicated for Ascend snt9 processors)

%

0%–100%

AI Processor HBM Memory Bandwidth Utilization

ma_node_npu_hbm_bandwidth_util

HBM memory bandwidth utilization of Ascend AI processors (dedicated for Ascend snt9 processors)

%

0%–100%

AI Processor HBM Memory Clock Frequency

ma_node_npu_hbm_frequency_hertz

HBM memory clock frequency of Ascend AI processors (dedicated for Ascend snt9 processors)

Hertz (Hz)

>0

AI Processor HBM Memory Temperature

ma_node_npu_hbm_temperature_celsius

HBM memory temperature of Ascend AI processors (dedicated for Ascend snt9 processors)

°C

Natural number

AI CPU Utilization

ma_node_npu_ai_cpu_util

AI CPU utilization of Ascend AI processors

%

0%–100%

AI Processor Control CPU Utilization

ma_node_npu_ctrl_cpu_util

Control CPU utilization of Ascend AI processors

%

0%–100%

InfiniBand or RoCE network

Total Amount of Data Received by a NIC

ma_node_infiniband_port_received_data_bytes_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

counting in double words, 32 bits

≥0

Total Amount of Data Sent by a NIC

ma_node_infiniband_port_transmitted_data_bytes_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

counting in double words, 32 bits

≥0

NFS mounting status

NFS Getattr Congestion Time

ma_node_mountstats_getattr_backlog_wait

Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥0

NFS Getattr Round Trip Time

ma_node_mountstats_getattr_rtt

Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc.

RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥0

NFS Access Congestion Time

ma_node_mountstats_access_backlog_wait

Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥0

NFS Access Round Trip Time

ma_node_mountstats_access_rtt

Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥0

NFS Lookup Congestion Time

ma_node_mountstats_lookup_backlog_wait

Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥0

NFS Lookup Round Trip Time

ma_node_mountstats_lookup_rtt

Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥0

NFS Read Congestion Time

ma_node_mountstats_read_backlog_wait

Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥0

NFS Read Round Trip Time

ma_node_mountstats_read_rtt

Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥0

NFS Write Congestion Time

ma_node_mountstats_write_backlog_wait

Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥0

NFS Write Round Trip Time

ma_node_mountstats_write_rtt

Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥0

Table 3 Metric names

Classification

Metric

Description

Container metrics

pod_name

Name of the pod to which the container belongs

pod_id

ID of the pod to which the container belongs

node_ip

IP address of the node to which the container belongs

container_id

Container ID

cluster_id

Cluster ID

cluster_name

Cluster name

container_name

Name of the container

namespace

Namespace where the POD created by the user is located.

app_kind

The value is obtained from the kind field in the first ownerReferences.

app_id

The value is obtained from the uid field in the first ownerReferences.

app_name

The value is obtained from the name field in the first ownerReferences.

npu_id

Ascend card ID, for example, davinci0 (to be discarded)

device_id

Physical ID of Ascend AI processors

device_type

Type of Ascend AI processors

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

pool_name

Name of a resource pool corresponding to a physical dedicated resource pool

gpu_uuid

UUID of the GPU used by the container

gpu_index

Index of the GPU used by the container

gpu_type

Type of the GPU used by the container

Node metrics

cluster_id

ID of the CCE cluster to which the node belongs

node_ip

IP address of the node

host_name

Hostname of a node

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

project_id

Project ID of the user in a physical dedicated resource pool

npu_id

Ascend card ID, for example, davinci0 (to be discarded)

device_id

Physical ID of Ascend AI processors

device_type

Type of Ascend AI processors

gpu_uuid

UUID of a node GPU

gpu_index

Index of a node GPU

gpu_type

Type of a node GPU

device_name

Device name of an InfiniBand or RoCE network NIC

port

Port number of the IB NIC

physical_state

Status of each port on the IB NIC

firmware_version

Firmware version of the InfiniBand NIC

filesystem

NFS-mounted file system

mount_point

NFS mount point

Diagnos

cluster_id

ID of the CCE cluster to which the node with the GPU equipped belongs

node_ip

IP address of the node where the GPU resides

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

project_id

Project ID of the user in a physical dedicated resource pool

gpu_uuid

GPU UUID

gpu_index

Index of a node GPU

gpu_type

Type of a node GPU

device_name

Device name of an InfiniBand or RoCE network NIC

port

Port number of the IB NIC

physical_state

Status of each port on the IB NIC

firmware_version

Firmware version of the InfiniBand NIC

Monitoring Custom Metrics

ModelArts allows you to run commands to save custom metrics to AOM.

Constraints

  • ModelArts invokes the commands or HTTP APIs specified in the custom configuration every 10 seconds to retrieve metric data.
  • The size of the metric data text returned by these commands or HTTP APIs must not exceed 8 KB.

Collecting Custom Metric Data Using Commands

The following is an example of the YAML file for creating a pod for collecting custom metrics:

apiVersion: v1
kind: Pod
metadata:
  name: my-task
  annotations:
ei.huaweicloud.com/metrics: '{"customMetrics":[{"containerName":"my-task","exec":{"command":["cat","/metrics/task.prom"]}}]}'  # Replace the containerName and command parameters based on the container from which metric data is obtained and the command used to obtain metric data.
spec:
  containers:
  - name: my-task
image: my-task-image:latest   # Replace it with the actual image.

Note: The service workload and custom metric collection can share the same container. Alternatively, use the SideCar container to collect metric data and designate it as the custom metric collection container. This ensures that the resources of the service workload container remain unaffected.

Data Format of Custom Metrics

The format of custom metrics data must comply with the open metrics specifications. That is, the format of each metric must be:

<Metric name>{<Tag name>=<Tag value>, ...} <Sampled value>[Millisecond timestamp]

The following is an example (the comment starts with #, which is optional):

# HELP http_requests_total The total number of HTTP requests.
# TYPE http_requests_total gauge
html_http_requests_total{method="post",code="200"} 1656 1686660980680
html_http_requests_total{method="post",code="400"} 2 1686660980681