Viewing All ModelArts Monitoring Metrics on the AOM Console
ModelArts periodically collects the usage of key metrics (such as GPUs, NPUs, CPUs, and memory) of each node in a resource pool as well as the usage of key metrics of development environments, training jobs, and inference services, and reports the data to AOM. You can view the information on AOM.
- Log in to the console and search for AOM to go to the AOM console.
- In the navigation pane on the left, choose Metric Browsing.
- Select the Prometheus_AOM_Default instance from the drop-down list.
Figure 1 Specifying the metric source
- Select one or more metrics from All metrics or Prometheus statement.
Figure 2 Adding a metric
For details about how to view metrics, see Application Operations Management > User Guide (2.0) > Metric Browsing in the Huawei Cloud Help Center.
The following table lists the metrics and labels supported by ModelArts.
Table 1 Container-level metrics Category
Name
Metric
Description
Unit
Value Range
CPU
CPU Usage
ma_container_cpu_util
CPU usage of a measured object
%
0%–100%
Used CPU Cores
ma_container_cpu_used_core
Number of CPU cores used by a measured object
Cores
≥ 0
Total CPU Cores
ma_container_cpu_limit_core
Total number of CPU cores that have been requested for a measured object
Cores
≥ 1
Memory
Total Physical Memory
ma_container_memory_capacity_megabytes
Total physical memory that has been requested for a measured object
MB
≥ 0
Physical Memory Usage
ma_container_memory_util
Percentage of the used physical memory to the total physical memory
%
0%–100%
Used Physical Memory
ma_container_memory_used_megabytes
Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set)
(Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes)
MB
≥ 0
Storage
Disk Read Rate
ma_container_disk_read_kilobytes
Volume of data read from a disk per second
KB/s
≥ 0
Disk Write Rate
ma_container_disk_write_kilobytes
Volume of data written into a disk per second
KB/s
≥ 0
GPU memory
Total GPU Memory
ma_container_gpu_mem_total_megabytes
Total GPU memory of a training job
MB
> 0
GPU Memory Usage
ma_container_gpu_mem_util
Percentage of the used GPU memory to the total GPU memory
%
0%–100%
Used GPU Memory
ma_container_gpu_mem_used_megabytes
GPU memory used by a measured object
MB
≥ 0
GPU
GPU Usage
ma_container_gpu_util
GPU usage of a measured object
%
0%–100%
GPU Memory Bandwidth Usage
ma_container_gpu_mem_copy_util
GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of GP Vnt1 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.
%
0%–100%
GPU Encoder Usage
ma_container_gpu_enc_util
GPU encoder usage of a measured object
%
%
GPU Decoder Usage
ma_container_gpu_dec_util
GPU decoder usage of a measured object
%
%
GPU Temperature
DCGM_FI_DEV_GPU_TEMP
GPU temperature
°C
Natural number
GPU Power
DCGM_FI_DEV_POWER_USAGE
GPU power
Watt (W)
> 0
GPU Memory Temperature
DCGM_FI_DEV_MEMORY_TEMP
GPU memory temperature
°C
Natural number
Network I/O
Downlink Rate (BPS)
ma_container_network_receive_bytes
Inbound traffic rate of a measured object
Bytes/s
≥ 0
Downlink Rate (PPS)
ma_container_network_receive_packets
Number of data packets received by a NIC per second
Packets/s
≥ 0
Downlink Error Rate
ma_container_network_receive_error_packets
Number of error packets received by a NIC per second
Packets/s
≥ 0
Uplink Rate (BPS)
ma_container_network_transmit_bytes
Outbound traffic rate of a measured object
Bytes/s
≥ 0
Uplink Error Rate
ma_container_network_transmit_error_packets
Number of error packets sent by a NIC per second
Packets/s
≥ 0
Uplink Rate (PPS)
ma_container_network_transmit_packets
Number of data packets sent by a NIC per second
Packets/s
≥ 0
Notebook service metrics
Notebook Cache Directory Size
ma_container_notebook_cache_dir_size_bytes
A high-speed local disk is attached to the /cache directory for GPU notebook instances. This metric indicates the total size of the directory.
Bytes
≥ 0
Notebook Cache Directory Utilization
ma_container_notebook_cache_dir_util
A high-speed local disk is attached to the /cache directory for GPU notebook instances. This metric indicates the utilization of the directory.
%
0%–100%
Table 2 Node metrics (collected only in dedicated resource pools) Category
Name
Metric
Description
Unit
Value Range
CPU
Total CPU Cores
ma_node_cpu_limit_core
Total number of CPU cores that have been requested for a measured object
Cores
≥ 1
Used CPU Cores
ma_node_cpu_used_core
Number of CPU cores used by a measured object
Cores
≥ 0
CPU Usage
ma_node_cpu_util
CPU usage of a measured object
%
0%–100%
CPU I/O Wait Time
ma_node_cpu_iowait_counter
Disk I/O wait time accumulated since system startup
jiffies
≥ 0
Memory
Physical Memory Usage
ma_node_memory_util
Percentage of the used physical memory to the total physical memory
%
0%–100%
Total Physical Memory
ma_node_memory_total_megabytes
Total physical memory that has been applied for a measured object
MB
≥ 0
Network I/O
Downlink Rate (BPS)
ma_node_network_receive_rate_bytes_seconds
Inbound traffic rate of a measured object
Bytes/s
≥ 0
Uplink Rate (BPS)
ma_node_network_transmit_rate_bytes_seconds
Outbound traffic rate of a measured object
Bytes/s
≥ 0
Storage
Disk Read Rate
ma_node_disk_read_rate_kilobytes_seconds
Volume of data read from a disk per second (Only data disks used by containers are collected.)
KB/s
≥ 0
Disk Write Rate
ma_node_disk_write_rate_kilobytes_seconds
Volume of data written into a disk per second (Only data disks used by containers are collected.)
KB/s
≥ 0
Total Cache
ma_node_cache_space_capacity_megabytes
Total cache of the Kubernetes space
MB
≥ 0
Used Cache
ma_node_cache_space_used_capacity_megabytes
Used cache of the Kubernetes space
MB
≥ 0
Total Container Space
ma_node_container_space_capacity_megabytes
Total container space
MB
≥ 0
Used Container Space
ma_node_container_space_used_capacity_megabytes
Used container space
MB
≥ 0
Disk Information
ma_node_disk_info
Basic disk information
N/A
≥ 0
Total Reads
ma_node_disk_reads_completed_total
Total number of successful reads
N/A
≥ 0
Merged Reads
ma_node_disk_reads_merged_total
Number of merged reads
N/A
≥ 0
Bytes Read
ma_node_disk_read_bytes_total
Total number of bytes that are successfully read
Bytes
≥ 0
Read Time Spent
ma_node_disk_read_time_seconds_total
Time spent on all reads
Seconds
≥ 0
Total Writes
ma_node_disk_writes_completed_total
Total number of successful writes
N/A
≥ 0
Merged Writes
ma_node_disk_writes_merged_total
Number of merged writes
N/A
≥ 0
Written Bytes
ma_node_disk_written_bytes_total
Total number of bytes that are successfully written
Bytes
≥ 0
Write Time Spent
ma_node_disk_write_time_seconds_total
Time spent on all write operations
Seconds
≥ 0
Ongoing I/Os
ma_node_disk_io_now
Number of ongoing I/Os
N/A
≥ 0
I/O Execution Duration
ma_node_disk_io_time_seconds_total
Time spent on executing I/Os
Seconds
≥ 0
I/O Execution Weighted Time
ma_node_disk_io_time_weighted_seconds_tota
Weighted time spent on executing I/Os
Seconds
≥ 0
GPU
GPU Usage
ma_node_gpu_util
GPU usage of a measured object
%
0%–100%
Total GPU Memory
ma_node_gpu_mem_total_megabytes
Total GPU memory of a measured object
MB
> 0
GPU Memory Usage
ma_node_gpu_mem_util
Percentage of the used GPU memory to the total GPU memory
%
0%–100%
Used GPU Memory
ma_node_gpu_mem_used_megabytes
GPU memory used by a measured object
MB
≥ 0
Tasks on a Shared GPU
node_gpu_share_job_count
Number of tasks running on a shared GPU
Number
≥ 0
GPU Temperature
DCGM_FI_DEV_GPU_TEMP
GPU temperature
°C
Natural number
GPU Power
DCGM_FI_DEV_POWER_USAGE
GPU power
Watt (W)
> 0
GPU Memory Temperature
DCGM_FI_DEV_MEMORY_TEMP
GPU memory temperature
°C
Natural number
InfiniBand or RoCE network
Total Amount of Data Received by a NIC
ma_node_infiniband_port_received_data_bytes_total
The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.
Double words (32 bits)
≥ 0
Total Amount of Data Sent by a NIC
ma_node_infiniband_port_transmitted_data_bytes_total
The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.
Double words (32 bits)
≥ 0
NFS mounting status
NFS Getattr Congestion Time
ma_node_mountstats_getattr_backlog_wait
Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.
ms
≥ 0
NFS Getattr Round Trip Time
ma_node_mountstats_getattr_rtt
Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc.
RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.
ms
≥ 0
NFS Access Congestion Time
ma_node_mountstats_access_backlog_wait
Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.
ms
≥ 0
NFS Access Round Trip Time
ma_node_mountstats_access_rtt
Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.
ms
≥ 0
NFS Lookup Congestion Time
ma_node_mountstats_lookup_backlog_wait
Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.
ms
≥ 0
NFS Lookup Round Trip Time
ma_node_mountstats_lookup_rtt
Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.
ms
≥ 0
NFS Read Congestion Time
ma_node_mountstats_read_backlog_wait
Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.
ms
≥ 0
NFS Read Round Trip Time
ma_node_mountstats_read_rtt
Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.
ms
≥ 0
NFS Write Congestion Time
ma_node_mountstats_write_backlog_wait
Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.
ms
≥ 0
NFS Write Round Trip Time
ma_node_mountstats_write_rtt
Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.
ms
≥ 0
Table 3 Diagnosis (InfiniBand, collected only in dedicated resource pools) Category
Name
Metric
Description
Unit
Value Range
InfiniBand or RoCE network
PortXmitData
infiniband_port_xmit_data_total
The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.
Total count
Natural number
PortRcvData
infiniband_port_rcv_data_total
The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.
Total count
Natural number
SymbolErrorCounter
infiniband_symbol_error_counter_total
Total number of minor link errors detected on one or more physical lanes.
Total count
Natural number
LinkErrorRecoveryCounter
infiniband_link_error_recovery_counter_total
Total number of times the Port Training state machine has successfully completed the link error recovery process.
Total count
Natural number
PortRcvErrors
infiniband_port_rcv_errors_total
Total number of packets containing errors that were received on the port including:
Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)
Malformed data packet errors (LVer, length, VL)
Malformed link packet errors (operand, length, VL)
Packets discarded due to buffer overrun (overflow)
Total count
Natural number
LocalLinkIntegrityErrors
infiniband_local_link_integrity_errors_total
This counter indicates the number of retries initiated by a link transfer layer receiver.
Total count
Natural number
PortRcvRemotePhysicalErrors
infiniband_port_rcv_remote_physical_errors_total
Total number of packets marked with the EBP delimiter received on the port.
Total count
Natural number
PortRcvSwitchRelayErrors
infiniband_port_rcv_switch_relay_errors_total
Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:
DLID mapping
VL mapping
Looping (output port = input port)
Total count
Natural number
PortXmitWait
infiniband_port_transmit_wait_total
The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).
Total count
Natural number
PortXmitDiscards
infiniband_port_xmit_discards_total
Total number of outbound packets discarded by the port because the port is down or congested.
Total count
Natural number
Table 4 Metric labels Classification
Label
Description
Container metrics
modelarts_service
Service to which a container belongs, which can be notebook, train, or infer
instance_name
Name of the pod to which the container belongs
service_id
Instance or job ID displayed on the page, for example, cf55829e-9bd3-48fa-8071-7ae870dae93a for a development environment
9f322d5a-b1d2-4370-94df-5a87de27d36e for a training job
node_ip
IP address of the node to which the container belongs
container_id
Container ID
cid
Cluster ID
container_name
Container name
project_id
Project ID of the account to which the user belongs
user_id
User ID of the account to which the user who submits the job belongs
pool_id
ID of a resource pool corresponding to a physical dedicated resource pool
pool_name
Name of a resource pool corresponding to a physical dedicated resource pool
logical_pool_id
ID of a logical subpool
logical_pool_name
Name of a logical subpool
gpu_uuid
UUID of the GPU used by the container
gpu_index
Index of the GPU used by the container
gpu_type
Type of the GPU used by the container
account_name
Account name of the creator of a training, inference, or development environment task
user_name
Username of the creator of a training, inference, or development environment task
task_creation_time
Time when a training, inference, or development environment task is created
task_name
Name of a training, inference, or development environment task
task_spec_code
Specifications of a training, inference, or development environment task
cluster_name
CCE cluster name
Node metrics
cid
ID of the CCE cluster to which the node belongs
node_ip
IP address of the node
host_name
Hostname of a node
pool_id
ID of a resource pool corresponding to a physical dedicated resource pool
project_id
Project ID of the user in a physical dedicated resource pool
gpu_uuid
UUID of a node GPU
gpu_index
Index of a node GPU
gpu_type
Type of a node GPU
device_name
Device name of an InfiniBand or RoCE network NIC
port
Port number of the InfiniBand NIC
physical_state
Status of each port on the InfiniBand NIC
firmware_version
Firmware version of the InfiniBand NIC
filesystem
NFS-mounted file system
mount_point
NFS mount point
Diagnos
cid
ID of the CCE cluster to which the node where the GPU resides belongs
node_ip
IP address of the node where the GPU resides
pool_id
ID of a resource pool corresponding to a physical dedicated resource pool
project_id
Project ID of the user in a physical dedicated resource pool
gpu_uuid
GPU UUID
gpu_index
Index of a node GPU
gpu_type
Type of a node GPU
device_name
Name of a network device or disk device
port
Port number of the InfiniBand NIC
physical_state
Status of each port on the InfiniBand NIC
firmware_version
Firmware version of the InfiniBand NIC
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot