Viewing All ModelArts Monitoring Metrics on the AOM Console
ModelArts periodically collects the usage of key metrics (such as GPUs, NPUs, CPUs, and memory) of each node in a resource pool as well as the usage of key metrics of the development environment, training jobs, and inference services, and reports the data to AOM. You can view the information on AOM.
- Log in to the console and search for AOM to go to the AOM console.
- Choose Metric Monitoring. On the Metric Monitoring page that is displayed, click Add Metric.
- Add metrics and click Add to Metric List.
- View the metrics.
Classification |
Name |
Metric |
Description |
Unit |
Value Range |
---|---|---|---|---|---|
CPU |
CPU Usage |
ma_container_cpu_util |
CPU usage of a measured object |
% |
0%–100% |
Used CPU Cores |
ma_container_cpu_used_core |
Number of CPU cores used by a measured object |
Cores |
≥ 0 |
|
Total CPU Cores |
ma_container_cpu_limit_core |
Total number of CPU cores that have been applied for a measured object |
Cores |
≥ 1 |
|
Memory |
Total Physical Memory |
ma_container_memory_capacity_megabytes |
Total physical memory that has been applied for a measured object |
MB |
≥ 0 |
Physical Memory Usage |
ma_container_memory_util |
Percentage of the used physical memory to the total physical memory |
% |
0%–100% |
|
Used Physical Memory |
ma_container_memory_used_megabytes |
Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set) (Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes) |
MB |
≥ 0 |
|
Storage |
Disk Read Rate |
ma_container_disk_read_kilobytes |
Volume of data read from a disk per second |
KB/s |
≥ 0 |
Disk Write Rate |
ma_container_disk_write_kilobytes |
Volume of data written into a disk per second |
KB/s |
≥ 0 |
|
GPU memory |
Total GPU Memory |
ma_container_gpu_mem_total_megabytes |
Total GPU memory of a training job |
MB |
> 0 |
GPU Memory Usage |
ma_container_gpu_mem_util |
Percentage of the used GPU memory to the total GPU memory |
% |
0%–100% |
|
Used GPU Memory |
ma_container_gpu_mem_used_megabytes |
GPU memory used by a measured object |
MB |
≥ 0 |
|
GPU |
GPU Usage |
ma_container_gpu_util |
GPU usage of a measured object |
% |
0%–100% |
GPU Memory Bandwidth Usage |
ma_container_gpu_mem_copy_util |
GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%. |
% |
0%–100% |
|
GPU Encoder Usage |
ma_container_gpu_enc_util |
GPU encoder usage of a measured object |
% |
% |
|
GPU Decoder Usage |
ma_container_gpu_dec_util |
GPU decoder usage of a measured object |
% |
% |
|
GPU Temperature |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature |
°C |
Natural number |
|
GPU Power |
DCGM_FI_DEV_POWER_USAGE |
GPU power |
Watt (W) |
> 0 |
|
GPU Memory Temperature |
DCGM_FI_DEV_MEMORY_TEMP |
GPU memory temperature |
°C |
Natural number |
|
Network I/O |
Downlink Rate (BPS) |
ma_container_network_receive_bytes |
Inbound traffic rate of a measured object |
Bytes/s |
≥ 0 |
Downlink Rate (PPS) |
ma_container_network_receive_packets |
Number of data packets received by an NIC per second |
Packets/s |
≥ 0 |
|
Downlink Error Rate |
ma_container_network_receive_error_packets |
Number of error packets received by an NIC per second |
Packets/s |
≥ 0 |
|
Uplink Rate (BPS) |
ma_container_network_transmit_bytes |
Outbound traffic rate of a measured object |
Bytes/s |
≥ 0 |
|
Uplink Error Rate |
ma_container_network_transmit_error_packets |
Number of error packets sent by an NIC per second |
Packets/s |
≥ 0 |
|
Uplink Rate (PPS) |
ma_container_network_transmit_packets |
Number of data packets sent by an NIC per second |
Packets/s |
≥ 0 |
|
Notebook service metrics |
Notebook Cache Directory Size |
ma_container_notebook_cache_dir_size_bytes |
A high-speed local disk is attached to the /cache directory for GPU notebook instances. This metric indicates the total size of the directory. |
Bytes |
≥ 0 |
Notebook Cache Directory Utilization |
ma_container_notebook_cache_dir_util |
A high-speed local disk is attached to the /cache directory for GPU notebook instances. This metric indicates the utilization of the directory. |
% |
0%–100% |
Classification |
Name |
Metric |
Description |
Unit |
Value Range |
---|---|---|---|---|---|
CPU |
Total CPU Cores |
ma_node_cpu_limit_core |
Total number of CPU cores that have been applied for a measured object |
Cores |
≥ 1 |
Used CPU Cores |
ma_node_cpu_used_core |
Number of CPU cores used by a measured object |
Cores |
≥ 0 |
|
CPU Usage |
ma_node_cpu_util |
CPU usage of a measured object |
% |
0%–100% |
|
CPU I/O Wait Time |
ma_node_cpu_iowait_counter |
Disk I/O wait time accumulated since system startup |
jiffies |
≥ 0 |
|
Memory |
Physical Memory Usage |
ma_node_memory_util |
Percentage of the used physical memory to the total physical memory |
% |
0%–100% |
Total Physical Memory |
ma_node_memory_total_megabytes |
Total physical memory that has been applied for a measured object |
MB |
≥ 0 |
|
Network I/O |
Downlink Rate (BPS) |
ma_node_network_receive_rate_bytes_seconds |
Inbound traffic rate of a measured object |
Bytes/s |
≥ 0 |
Uplink Rate (BPS) |
ma_node_network_transmit_rate_bytes_seconds |
Outbound traffic rate of a measured object |
Bytes/s |
≥ 0 |
|
Storage |
Disk Read Rate |
ma_node_disk_read_rate_kilobytes_seconds |
Volume of data read from a disk per second (Only data disks used by containers are collected.) |
KB/s |
≥ 0 |
Disk Write Rate |
ma_node_disk_write_rate_kilobytes_seconds |
Volume of data written into a disk per second (Only data disks used by containers are collected.) |
KB/s |
≥ 0 |
|
Total Cache |
ma_node_cache_space_capacity_megabytes |
Total cache of the Kubernetes space |
MB |
≥ 0 |
|
Used Cache |
ma_node_cache_space_used_capacity_megabytes |
Used cache of the Kubernetes space |
MB |
≥ 0 |
|
Total Container Space |
ma_node_container_space_capacity_megabytes |
Total container space |
MB |
≥ 0 |
|
Used Container Space |
ma_node_container_space_used_capacity_megabytes |
Used container space |
MB |
≥ 0 |
|
Disk Information |
ma_node_disk_info |
Basic disk information |
N/A |
≥ 0 |
|
Total Reads |
ma_node_disk_reads_completed_total |
Total number of successful reads |
N/A |
≥ 0 |
|
Merged Reads |
ma_node_disk_reads_merged_total |
Number of merged reads |
N/A |
≥ 0 |
|
Bytes Read |
ma_node_disk_read_bytes_total |
Total number of bytes that are successfully read |
Bytes |
≥ 0 |
|
Read Time Spent |
ma_node_disk_read_time_seconds_total |
Time spent on all reads |
Seconds |
≥ 0 |
|
Total Writes |
ma_node_disk_writes_completed_total |
Total number of successful writes |
N/A |
≥ 0 |
|
Merged Writes |
ma_node_disk_writes_merged_total |
Number of merged writes |
N/A |
≥ 0 |
|
Written Bytes |
ma_node_disk_written_bytes_total |
Total number of bytes that are successfully written |
Bytes |
≥ 0 |
|
Write Time Spent |
ma_node_disk_write_time_seconds_total |
Time spent on all write operations |
Seconds |
≥ 0 |
|
Ongoing I/Os |
ma_node_disk_io_now |
Number of ongoing I/Os |
N/A |
≥ 0 |
|
I/O Execution Duration |
ma_node_disk_io_time_seconds_total |
Time spent on executing I/Os |
Seconds |
≥ 0 |
|
I/O Execution Weighted Time |
ma_node_disk_io_time_weighted_seconds_tota |
Weighted time spent on executing I/Os |
Seconds |
≥ 0 |
|
GPU |
GPU Usage |
ma_node_gpu_util |
GPU usage of a measured object |
% |
0%–100% |
Total GPU Memory |
ma_node_gpu_mem_total_megabytes |
Total GPU memory of a measured object |
MB |
> 0 |
|
GPU Memory Usage |
ma_node_gpu_mem_util |
Percentage of the used GPU memory to the total GPU memory |
% |
0%–100% |
|
Used GPU Memory |
ma_node_gpu_mem_used_megabytes |
GPU memory used by a measured object |
MB |
≥ 0 |
|
Tasks on a Shared GPU |
node_gpu_share_job_count |
Number of tasks running on a shared GPU |
Number |
≥ 0 |
|
GPU Temperature |
DCGM_FI_DEV_GPU_TEMP |
GPU temperature |
°C |
Natural number |
|
GPU Power |
DCGM_FI_DEV_POWER_USAGE |
GPU power |
Watt (W) |
> 0 |
|
GPU Memory Temperature |
DCGM_FI_DEV_MEMORY_TEMP |
GPU memory temperature |
°C |
Natural number |
|
InfiniBand or RoCE network |
Total Amount of Data Received by an NIC |
ma_node_infiniband_port_received_data_bytes_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. |
Double words (32 bits) |
≥ 0 |
Total Amount of Data Sent by an NIC |
ma_node_infiniband_port_transmitted_data_bytes_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. |
Double words (32 bits) |
≥ 0 |
|
NFS mounting status |
NFS Getattr Congestion Time |
ma_node_mountstats_getattr_backlog_wait |
Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
NFS Getattr Round Trip Time |
ma_node_mountstats_getattr_rtt |
Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
|
NFS Access Congestion Time |
ma_node_mountstats_access_backlog_wait |
Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
|
NFS Access Round Trip Time |
ma_node_mountstats_access_rtt |
Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
|
NFS Lookup Congestion Time |
ma_node_mountstats_lookup_backlog_wait |
Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
|
NFS Lookup Round Trip Time |
ma_node_mountstats_lookup_rtt |
Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
|
NFS Read Congestion Time |
ma_node_mountstats_read_backlog_wait |
Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
|
NFS Read Round Trip Time |
ma_node_mountstats_read_rtt |
Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
|
NFS Write Congestion Time |
ma_node_mountstats_write_backlog_wait |
Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times. |
ms |
≥ 0 |
|
NFS Write Round Trip Time |
ma_node_mountstats_write_rtt |
Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues. |
ms |
≥ 0 |
Classification |
Name |
Metric |
Description |
Unit |
Value Range |
---|---|---|---|---|---|
InfiniBand or RoCE network |
PortXmitData |
infiniband_port_xmit_data_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port. |
Total count |
Natural number |
PortRcvData |
infiniband_port_rcv_data_total |
The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port. |
Total count |
Natural number |
|
SymbolErrorCounter |
infiniband_symbol_error_counter_total |
Total number of minor link errors detected on one or more physical lanes. |
Total count |
Natural number |
|
LinkErrorRecoveryCounter |
infiniband_link_error_recovery_counter_total |
Total number of times the Port Training state machine has successfully completed the link error recovery process. |
Total count |
Natural number |
|
PortRcvErrors |
infiniband_port_rcv_errors_total |
Total number of packets containing errors that were received on the port including: Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine) Malformed data packet errors (LVer, length, VL) Malformed link packet errors (operand, length, VL) Packets discarded due to buffer overrun (overflow) |
Total count |
Natural number |
|
LocalLinkIntegrityErrors |
infiniband_local_link_integrity_errors_total |
This counter indicates the number of retries initiated by a link transfer layer receiver. |
Total count |
Natural number |
|
PortRcvRemotePhysicalErrors |
infiniband_port_rcv_remote_physical_errors_total |
Total number of packets marked with the EBP delimiter received on the port. |
Total count |
Natural number |
|
PortRcvSwitchRelayErrors |
infiniband_port_rcv_switch_relay_errors_total |
Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons: DLID mapping VL mapping Looping (output port = input port) |
Total count |
Natural number |
|
PortXmitWait |
infiniband_port_transmit_wait_total |
The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration). |
Total count |
Natural number |
|
PortXmitDiscards |
infiniband_port_xmit_discards_total |
Total number of outbound packets discarded by the port because the port is down or congested. |
Total count |
Natural number |
Classification |
Metric |
Description |
---|---|---|
Container metrics |
modelarts_service |
Service to which a container belongs, which can be notebook, train, or infer |
instance_name |
Name of the pod to which the container belongs |
|
service_id |
Instance or job ID displayed on the page, for example, cf55829e-9bd3-48fa-8071-7ae870dae93a for a development environment 9f322d5a-b1d2-4370-94df-5a87de27d36e for a training job |
|
node_ip |
IP address of the node to which the container belongs |
|
container_id |
Container ID |
|
cid |
Cluster ID |
|
container_name |
Name of the container |
|
project_id |
Project ID of the account to which the user belongs |
|
user_id |
User ID of the account to which the user who submits the job belongs |
|
npu_id |
Ascend card ID, for example, davinci0 (to be discarded) |
|
device_id |
Physical ID of Ascend AI processors |
|
device_type |
Type of Ascend AI processors |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
pool_name |
Name of a resource pool corresponding to a physical dedicated resource pool |
|
logical_pool_id |
ID of a logical subpool |
|
logical_pool_name |
Name of a logical subpool |
|
gpu_uuid |
UUID of the GPU used by the container |
|
gpu_index |
Index of the GPU used by the container |
|
gpu_type |
Type of the GPU used by the container |
|
account_name |
Account name of the creator of a training, inference, or development environment task |
|
user_name |
Username of the creator of a training, inference, or development environment task |
|
task_creation_time |
Time when a training, inference, or development environment task is created |
|
task_name |
Name of a training, inference, or development environment task |
|
task_spec_code |
Specifications of a training, inference, or development environment task |
|
cluster_name |
CCE cluster name |
|
Node metrics |
cid |
ID of the CCE cluster to which the node belongs |
node_ip |
IP address of the node |
|
host_name |
Hostname of a node |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
project_id |
Project ID of the user in a physical dedicated resource pool |
|
npu_id |
Ascend card ID, for example, davinci0 (to be discarded) |
|
device_id |
Physical ID of Ascend AI processors |
|
device_type |
Type of Ascend AI processors |
|
gpu_uuid |
UUID of a node GPU |
|
gpu_index |
Index of a node GPU |
|
gpu_type |
Type of a node GPU |
|
device_name |
Device name of an InfiniBand or RoCE network NIC |
|
port |
Port number of the IB NIC |
|
physical_state |
Status of each port on the IB NIC |
|
firmware_version |
Firmware version of the IB NIC |
|
filesystem |
NFS-mounted file system |
|
mount_point |
NFS mount point |
|
Diagnos |
cid |
ID of the CCE cluster to which the node where the GPU resides belongs |
node_ip |
IP address of the node where the GPU resides |
|
pool_id |
ID of a resource pool corresponding to a physical dedicated resource pool |
|
project_id |
Project ID of the user in a physical dedicated resource pool |
|
gpu_uuid |
GPU UUID |
|
gpu_index |
Index of a node GPU |
|
gpu_type |
Type of a node GPU |
|
device_name |
Name of a network device or disk device |
|
port |
Port number of the IB NIC |
|
physical_state |
Status of each port on the IB NIC |
|
firmware_version |
Firmware version of the IB NIC |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.