Help Center/ ModelArts/ Resource Management/ Monitoring Resources/ Viewing All ModelArts Monitoring Metrics on the AOM Console
Updated on 2024-06-11 GMT+08:00

Viewing All ModelArts Monitoring Metrics on the AOM Console

ModelArts periodically collects the usage of key metrics (such as GPUs, NPUs, CPUs, and memory) of each node in a resource pool as well as the usage of key metrics of the development environment, training jobs, and inference services, and reports the data to AOM. You can view the information on AOM.

  1. Log in to the console and search for AOM to go to the AOM console.
  2. Choose Metric Monitoring. On the Metric Monitoring page that is displayed, click Add Metric.

  3. Add metrics and click Add to Metric List.

    • Add By: Select All Metrics.
    • Metric Name: Select the desired ones for query. For details, see Table 1, Table 2, and Table 3.
    • Scope: Enter the tag for filtering the metric. For details, see Table 4. The following shows an example.

  4. View the metrics.

Table 1 Container metrics

Classification

Name

Metric

Description

Unit

Value Range

CPU

CPU Usage

ma_container_cpu_util

CPU usage of a measured object

%

0%–100%

Used CPU Cores

ma_container_cpu_used_core

Number of CPU cores used by a measured object

Cores

≥ 0

Total CPU Cores

ma_container_cpu_limit_core

Total number of CPU cores that have been applied for a measured object

Cores

≥ 1

Memory

Total Physical Memory

ma_container_memory_capacity_megabytes

Total physical memory that has been applied for a measured object

MB

≥ 0

Physical Memory Usage

ma_container_memory_util

Percentage of the used physical memory to the total physical memory

%

0%–100%

Used Physical Memory

ma_container_memory_used_megabytes

Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set) (Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes)

MB

≥ 0

Storage

Disk Read Rate

ma_container_disk_read_kilobytes

Volume of data read from a disk per second

KB/s

≥ 0

Disk Write Rate

ma_container_disk_write_kilobytes

Volume of data written into a disk per second

KB/s

≥ 0

GPU memory

Total GPU Memory

ma_container_gpu_mem_total_megabytes

Total GPU memory of a training job

MB

> 0

GPU Memory Usage

ma_container_gpu_mem_util

Percentage of the used GPU memory to the total GPU memory

%

0%–100%

Used GPU Memory

ma_container_gpu_mem_used_megabytes

GPU memory used by a measured object

MB

≥ 0

GPU

GPU Usage

ma_container_gpu_util

GPU usage of a measured object

%

0%–100%

GPU Memory Bandwidth Usage

ma_container_gpu_mem_copy_util

GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.

%

0%–100%

GPU Encoder Usage

ma_container_gpu_enc_util

GPU encoder usage of a measured object

%

%

GPU Decoder Usage

ma_container_gpu_dec_util

GPU decoder usage of a measured object

%

%

GPU Temperature

DCGM_FI_DEV_GPU_TEMP

GPU temperature

°C

Natural number

GPU Power

DCGM_FI_DEV_POWER_USAGE

GPU power

Watt (W)

> 0

GPU Memory Temperature

DCGM_FI_DEV_MEMORY_TEMP

GPU memory temperature

°C

Natural number

Network I/O

Downlink Rate (BPS)

ma_container_network_receive_bytes

Inbound traffic rate of a measured object

Bytes/s

≥ 0

Downlink Rate (PPS)

ma_container_network_receive_packets

Number of data packets received by an NIC per second

Packets/s

≥ 0

Downlink Error Rate

ma_container_network_receive_error_packets

Number of error packets received by an NIC per second

Packets/s

≥ 0

Uplink Rate (BPS)

ma_container_network_transmit_bytes

Outbound traffic rate of a measured object

Bytes/s

≥ 0

Uplink Error Rate

ma_container_network_transmit_error_packets

Number of error packets sent by an NIC per second

Packets/s

≥ 0

Uplink Rate (PPS)

ma_container_network_transmit_packets

Number of data packets sent by an NIC per second

Packets/s

≥ 0

Notebook service metrics

Notebook Cache Directory Size

ma_container_notebook_cache_dir_size_bytes

A high-speed local disk is attached to the /cache directory for GPU notebook instances. This metric indicates the total size of the directory.

Bytes

≥ 0

Notebook Cache Directory Utilization

ma_container_notebook_cache_dir_util

A high-speed local disk is attached to the /cache directory for GPU notebook instances. This metric indicates the utilization of the directory.

%

0%–100%

Table 2 Node metrics (collected only in dedicated resource pools)

Classification

Name

Metric

Description

Unit

Value Range

CPU

Total CPU Cores

ma_node_cpu_limit_core

Total number of CPU cores that have been applied for a measured object

Cores

≥ 1

Used CPU Cores

ma_node_cpu_used_core

Number of CPU cores used by a measured object

Cores

≥ 0

CPU Usage

ma_node_cpu_util

CPU usage of a measured object

%

0%–100%

CPU I/O Wait Time

ma_node_cpu_iowait_counter

Disk I/O wait time accumulated since system startup

jiffies

≥ 0

Memory

Physical Memory Usage

ma_node_memory_util

Percentage of the used physical memory to the total physical memory

%

0%–100%

Total Physical Memory

ma_node_memory_total_megabytes

Total physical memory that has been applied for a measured object

MB

≥ 0

Network I/O

Downlink Rate (BPS)

ma_node_network_receive_rate_bytes_seconds

Inbound traffic rate of a measured object

Bytes/s

≥ 0

Uplink Rate (BPS)

ma_node_network_transmit_rate_bytes_seconds

Outbound traffic rate of a measured object

Bytes/s

≥ 0

Storage

Disk Read Rate

ma_node_disk_read_rate_kilobytes_seconds

Volume of data read from a disk per second (Only data disks used by containers are collected.)

KB/s

≥ 0

Disk Write Rate

ma_node_disk_write_rate_kilobytes_seconds

Volume of data written into a disk per second (Only data disks used by containers are collected.)

KB/s

≥ 0

Total Cache

ma_node_cache_space_capacity_megabytes

Total cache of the Kubernetes space

MB

≥ 0

Used Cache

ma_node_cache_space_used_capacity_megabytes

Used cache of the Kubernetes space

MB

≥ 0

Total Container Space

ma_node_container_space_capacity_megabytes

Total container space

MB

≥ 0

Used Container Space

ma_node_container_space_used_capacity_megabytes

Used container space

MB

≥ 0

Disk Information

ma_node_disk_info

Basic disk information

N/A

≥ 0

Total Reads

ma_node_disk_reads_completed_total

Total number of successful reads

N/A

≥ 0

Merged Reads

ma_node_disk_reads_merged_total

Number of merged reads

N/A

≥ 0

Bytes Read

ma_node_disk_read_bytes_total

Total number of bytes that are successfully read

Bytes

≥ 0

Read Time Spent

ma_node_disk_read_time_seconds_total

Time spent on all reads

Seconds

≥ 0

Total Writes

ma_node_disk_writes_completed_total

Total number of successful writes

N/A

≥ 0

Merged Writes

ma_node_disk_writes_merged_total

Number of merged writes

N/A

≥ 0

Written Bytes

ma_node_disk_written_bytes_total

Total number of bytes that are successfully written

Bytes

≥ 0

Write Time Spent

ma_node_disk_write_time_seconds_total

Time spent on all write operations

Seconds

≥ 0

Ongoing I/Os

ma_node_disk_io_now

Number of ongoing I/Os

N/A

≥ 0

I/O Execution Duration

ma_node_disk_io_time_seconds_total

Time spent on executing I/Os

Seconds

≥ 0

I/O Execution Weighted Time

ma_node_disk_io_time_weighted_seconds_tota

Weighted time spent on executing I/Os

Seconds

≥ 0

GPU

GPU Usage

ma_node_gpu_util

GPU usage of a measured object

%

0%–100%

Total GPU Memory

ma_node_gpu_mem_total_megabytes

Total GPU memory of a measured object

MB

> 0

GPU Memory Usage

ma_node_gpu_mem_util

Percentage of the used GPU memory to the total GPU memory

%

0%–100%

Used GPU Memory

ma_node_gpu_mem_used_megabytes

GPU memory used by a measured object

MB

≥ 0

Tasks on a Shared GPU

node_gpu_share_job_count

Number of tasks running on a shared GPU

Number

≥ 0

GPU Temperature

DCGM_FI_DEV_GPU_TEMP

GPU temperature

°C

Natural number

GPU Power

DCGM_FI_DEV_POWER_USAGE

GPU power

Watt (W)

> 0

GPU Memory Temperature

DCGM_FI_DEV_MEMORY_TEMP

GPU memory temperature

°C

Natural number

InfiniBand or RoCE network

Total Amount of Data Received by an NIC

ma_node_infiniband_port_received_data_bytes_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

Double words (32 bits)

≥ 0

Total Amount of Data Sent by an NIC

ma_node_infiniband_port_transmitted_data_bytes_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

Double words (32 bits)

≥ 0

NFS mounting status

NFS Getattr Congestion Time

ma_node_mountstats_getattr_backlog_wait

Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

NFS Getattr Round Trip Time

ma_node_mountstats_getattr_rtt

Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc.

RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

NFS Access Congestion Time

ma_node_mountstats_access_backlog_wait

Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

NFS Access Round Trip Time

ma_node_mountstats_access_rtt

Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

NFS Lookup Congestion Time

ma_node_mountstats_lookup_backlog_wait

Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

NFS Lookup Round Trip Time

ma_node_mountstats_lookup_rtt

Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

NFS Read Congestion Time

ma_node_mountstats_read_backlog_wait

Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

NFS Read Round Trip Time

ma_node_mountstats_read_rtt

Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

NFS Write Congestion Time

ma_node_mountstats_write_backlog_wait

Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

ms

≥ 0

NFS Write Round Trip Time

ma_node_mountstats_write_rtt

Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

ms

≥ 0

Table 3 Diagnosis (IB, collected only in dedicated resource pools)

Classification

Name

Metric

Description

Unit

Value Range

InfiniBand or RoCE network

PortXmitData

infiniband_port_xmit_data_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

Total count

Natural number

PortRcvData

infiniband_port_rcv_data_total

The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

Total count

Natural number

SymbolErrorCounter

infiniband_symbol_error_counter_total

Total number of minor link errors detected on one or more physical lanes.

Total count

Natural number

LinkErrorRecoveryCounter

infiniband_link_error_recovery_counter_total

Total number of times the Port Training state machine has successfully completed the link error recovery process.

Total count

Natural number

PortRcvErrors

infiniband_port_rcv_errors_total

Total number of packets containing errors that were received on the port including:

Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

Malformed data packet errors (LVer, length, VL)

Malformed link packet errors (operand, length, VL)

Packets discarded due to buffer overrun (overflow)

Total count

Natural number

LocalLinkIntegrityErrors

infiniband_local_link_integrity_errors_total

This counter indicates the number of retries initiated by a link transfer layer receiver.

Total count

Natural number

PortRcvRemotePhysicalErrors

infiniband_port_rcv_remote_physical_errors_total

Total number of packets marked with the EBP delimiter received on the port.

Total count

Natural number

PortRcvSwitchRelayErrors

infiniband_port_rcv_switch_relay_errors_total

Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:

DLID mapping

VL mapping

Looping (output port = input port)

Total count

Natural number

PortXmitWait

infiniband_port_transmit_wait_total

The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).

Total count

Natural number

PortXmitDiscards

infiniband_port_xmit_discards_total

Total number of outbound packets discarded by the port because the port is down or congested.

Total count

Natural number

Table 4 Metric names

Classification

Metric

Description

Container metrics

modelarts_service

Service to which a container belongs, which can be notebook, train, or infer

instance_name

Name of the pod to which the container belongs

service_id

Instance or job ID displayed on the page, for example, cf55829e-9bd3-48fa-8071-7ae870dae93a for a development environment

9f322d5a-b1d2-4370-94df-5a87de27d36e for a training job

node_ip

IP address of the node to which the container belongs

container_id

Container ID

cid

Cluster ID

container_name

Name of the container

project_id

Project ID of the account to which the user belongs

user_id

User ID of the account to which the user who submits the job belongs

npu_id

Ascend card ID, for example, davinci0 (to be discarded)

device_id

Physical ID of Ascend AI processors

device_type

Type of Ascend AI processors

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

pool_name

Name of a resource pool corresponding to a physical dedicated resource pool

logical_pool_id

ID of a logical subpool

logical_pool_name

Name of a logical subpool

gpu_uuid

UUID of the GPU used by the container

gpu_index

Index of the GPU used by the container

gpu_type

Type of the GPU used by the container

account_name

Account name of the creator of a training, inference, or development environment task

user_name

Username of the creator of a training, inference, or development environment task

task_creation_time

Time when a training, inference, or development environment task is created

task_name

Name of a training, inference, or development environment task

task_spec_code

Specifications of a training, inference, or development environment task

cluster_name

CCE cluster name

Node metrics

cid

ID of the CCE cluster to which the node belongs

node_ip

IP address of the node

host_name

Hostname of a node

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

project_id

Project ID of the user in a physical dedicated resource pool

npu_id

Ascend card ID, for example, davinci0 (to be discarded)

device_id

Physical ID of Ascend AI processors

device_type

Type of Ascend AI processors

gpu_uuid

UUID of a node GPU

gpu_index

Index of a node GPU

gpu_type

Type of a node GPU

device_name

Device name of an InfiniBand or RoCE network NIC

port

Port number of the IB NIC

physical_state

Status of each port on the IB NIC

firmware_version

Firmware version of the IB NIC

filesystem

NFS-mounted file system

mount_point

NFS mount point

Diagnos

cid

ID of the CCE cluster to which the node where the GPU resides belongs

node_ip

IP address of the node where the GPU resides

pool_id

ID of a resource pool corresponding to a physical dedicated resource pool

project_id

Project ID of the user in a physical dedicated resource pool

gpu_uuid

GPU UUID

gpu_index

Index of a node GPU

gpu_type

Type of a node GPU

device_name

Name of a network device or disk device

port

Port number of the IB NIC

physical_state

Status of each port on the IB NIC

firmware_version

Firmware version of the IB NIC