Help Center> ModelArts> Resource Management> Monitoring Resources> Viewing All ModelArts Monitoring Metrics on the AOM Console
Updated on 2024-04-30 GMT+08:00

Viewing All ModelArts Monitoring Metrics on the AOM Console

ModelArts periodically collects the usage of key metrics (such as GPUs, NPUs, CPUs, and memory) of each node in a resource pool as well as the usage of key metrics of the development environment, training jobs, and inference services, and reports the data to AOM. You can view the information on AOM.

  1. Log in to the console and search for AOM to go to the AOM console.
  2. Choose Monitoring > Metric Monitoring. On the Metric Monitoring page that is displayed, click Add Metric.

  3. Add metrics and click Confirm.

    • Add By: Select Dimension.
    • Metric Name: Click Custom Metrics. Select the desired ones for query. For details, see Table 1, Table 2, and Table 3.
    • Dimension: Enter the tag for filtering the metric. For details, see Table 4. The following shows an example.

  4. View the metrics.

    Table 1 Container metrics

    Category

    Name

    Metric

    Description

    Unit

    Value Range

    CPU

    CPU Usage

    ma_container_cpu_util

    CPU usage of a measured object

    %

    0%–100%

    Used CPU Cores

    ma_container_cpu_used_core

    Number of CPU cores used by a measured object

    Cores

    ≥ 0

    Total CPU Cores

    ma_container_cpu_limit_core

    Total number of CPU cores that have been applied for a measured object

    Cores

    ≥ 1

    Memory

    Total Physical Memory

    ma_container_memory_capacity_megabytes

    Total physical memory that has been applied for a measured object

    MB

    ≥ 0

    Physical Memory Usage

    ma_container_memory_util

    Percentage of the used physical memory to the total physical memory

    %

    0%–100%

    Used Physical Memory

    ma_container_memory_used_megabytes

    Physical memory that has been used by a measured object (container_memory_working_set_bytes in the current working set)

    (Memory usage in a working set = Active anonymous page and cache, and file-baked page ≤ container_memory_usage_bytes)

    MB

    ≥ 0

    Storage

    Disk Read Rate

    ma_container_disk_read_kilobytes

    Volume of data read from a disk per second

    KB/s

    ≥ 0

    Disk Write Rate

    ma_container_disk_write_kilobytes

    Volume of data written into a disk per second

    KB/s

    ≥ 0

    GPU memory

    Total GPU Memory

    ma_container_gpu_mem_total_megabytes

    Total GPU memory of a training job

    MB

    > 0

    GPU Memory Usage

    ma_container_gpu_mem_util

    Percentage of the used GPU memory to the total GPU memory

    %

    0%–100%

    Used GPU Memory

    ma_container_gpu_mem_used_megabytes

    GPU memory used by a measured object

    MB

    ≥ 0

    GPU

    GPU Usage

    ma_container_gpu_util

    GPU usage of a measured object

    %

    0%–100%

    GPU Memory Bandwidth Usage

    ma_container_gpu_mem_copy_util

    GPU memory bandwidth usage of a measured object For example, the maximum memory bandwidth of NVIDIA GPU V100 is 900 GB/s. If the current memory bandwidth is 450 GB/s, the memory bandwidth usage is 50%.

    %

    0%–100%

    GPU Encoder Usage

    ma_container_gpu_enc_util

    GPU encoder usage of a measured object

    %

    %

    GPU Decoder Usage

    ma_container_gpu_dec_util

    GPU decoder usage of a measured object

    %

    %

    GPU Temperature

    DCGM_FI_DEV_GPU_TEMP

    GPU temperature

    °C

    Natural number

    GPU Power

    DCGM_FI_DEV_POWER_USAGE

    GPU power

    Watt (W)

    > 0

    GPU Memory Temperature

    DCGM_FI_DEV_MEMORY_TEMP

    GPU memory temperature

    °C

    Natural number

    Network I/O

    Downlink Rate (BPS)

    ma_container_network_receive_bytes

    Inbound traffic rate of a measured object

    Bytes/s

    ≥ 0

    Downlink Rate (PPS)

    ma_container_network_receive_packets

    Number of data packets received by a NIC per second

    Packets/s

    ≥ 0

    Downlink Error Rate

    ma_container_network_receive_error_packets

    Number of error packets received by a NIC per second

    Packets/s

    ≥ 0

    Uplink Rate (BPS)

    ma_container_network_transmit_bytes

    Outbound traffic rate of a measured object

    Bytes/s

    ≥ 0

    Uplink Error Rate

    ma_container_network_transmit_error_packets

    Number of error packets sent by a NIC per second

    Packets/s

    ≥ 0

    Uplink Rate (PPS)

    ma_container_network_transmit_packets

    Number of data packets sent by a NIC per second

    Packets/s

    ≥ 0

    Notebook service metrics

    Notebook Cache Directory Size

    ma_container_notebook_cache_dir_size_bytes

    A high-speed local disk is attached to the /cache directory for GPU notebook instances. This metric indicates the total size of the directory.

    Bytes

    ≥ 0

    Notebook Cache Directory Utilization

    ma_container_notebook_cache_dir_util

    A high-speed local disk is attached to the /cache directory for GPU notebook instances. This metric indicates the utilization of the directory.

    %

    0%–100%

    Table 2 Node metrics (collected only in dedicated resource pools)

    Category

    Name

    Metric

    Description

    Unit

    Value Range

    CPU

    Total CPU Cores

    ma_node_cpu_limit_core

    Total number of CPU cores that have been applied for a measured object

    Cores

    ≥ 1

    Used CPU Cores

    ma_node_cpu_used_core

    Number of CPU cores used by a measured object

    Cores

    ≥ 0

    CPU Usage

    ma_node_cpu_util

    CPU usage of a measured object

    %

    0%–100%

    CPU I/O Wait Time

    ma_node_cpu_iowait_counter

    Disk I/O wait time accumulated since system startup

    jiffies

    ≥ 0

    Memory

    Physical Memory Usage

    ma_node_memory_util

    Percentage of the used physical memory to the total physical memory

    %

    0%–100%

    Total Physical Memory

    ma_node_memory_total_megabytes

    Total physical memory that has been applied for a measured object

    MB

    ≥ 0

    Network I/O

    Downlink Rate (BPS)

    ma_node_network_receive_rate_bytes_seconds

    Inbound traffic rate of a measured object

    Bytes/s

    ≥ 0

    Uplink Rate (BPS)

    ma_node_network_transmit_rate_bytes_seconds

    Outbound traffic rate of a measured object

    Bytes/s

    ≥ 0

    Storage

    Disk Read Rate

    ma_node_disk_read_rate_kilobytes_seconds

    Volume of data read from a disk per second (Only data disks used by containers are collected.)

    KB/s

    ≥ 0

    Disk Write Rate

    ma_node_disk_write_rate_kilobytes_seconds

    Volume of data written into a disk per second (Only data disks used by containers are collected.)

    KB/s

    ≥ 0

    Total Cache

    ma_node_cache_space_capacity_megabytes

    Total cache of the Kubernetes space

    MB

    ≥ 0

    Used Cache

    ma_node_cache_space_used_capacity_megabytes

    Used cache of the Kubernetes space

    MB

    ≥ 0

    Total Container Space

    ma_node_container_space_capacity_megabytes

    Total container space

    MB

    ≥ 0

    Used Container Space

    ma_node_container_space_used_capacity_megabytes

    Used container space

    MB

    ≥ 0

    Disk Information

    ma_node_disk_info

    Basic disk information

    N/A

    ≥ 0

    Total Reads

    ma_node_disk_reads_completed_total

    Total number of successful reads

    N/A

    ≥ 0

    Merged Reads

    ma_node_disk_reads_merged_total

    Number of merged reads

    N/A

    ≥ 0

    Bytes Read

    ma_node_disk_read_bytes_total

    Total number of bytes that are successfully read

    Bytes

    ≥ 0

    Read Time Spent

    ma_node_disk_read_time_seconds_total

    Time spent on all reads

    Seconds

    ≥ 0

    Total Writes

    ma_node_disk_writes_completed_total

    Total number of successful writes

    N/A

    ≥ 0

    Merged Writes

    ma_node_disk_writes_merged_total

    Number of merged writes

    N/A

    ≥ 0

    Written Bytes

    ma_node_disk_written_bytes_total

    Total number of bytes that are successfully written

    Bytes

    ≥ 0

    Write Time Spent

    ma_node_disk_write_time_seconds_total

    Time spent on all write operations

    Seconds

    ≥ 0

    Ongoing I/Os

    ma_node_disk_io_now

    Number of ongoing I/Os

    N/A

    ≥ 0

    I/O Execution Duration

    ma_node_disk_io_time_seconds_total

    Time spent on executing I/Os

    Seconds

    ≥ 0

    I/O Execution Weighted Time

    ma_node_disk_io_time_weighted_seconds_tota

    The weighted number of seconds spent doing I/Os

    Seconds

    ≥ 0

    GPU

    GPU Usage

    ma_node_gpu_util

    GPU usage of a measured object

    %

    0%–100%

    Total GPU Memory

    ma_node_gpu_mem_total_megabytes

    Total GPU memory of a measured object

    MB

    > 0

    GPU Memory Usage

    ma_node_gpu_mem_util

    Percentage of the used GPU memory to the total GPU memory

    %

    0%–100%

    Used GPU Memory

    ma_node_gpu_mem_used_megabytes

    GPU memory used by a measured object

    MB

    ≥ 0

    Tasks on a Shared GPU

    node_gpu_share_job_count

    Number of tasks running on a shared GPU

    Number

    ≥ 0

    GPU Temperature

    DCGM_FI_DEV_GPU_TEMP

    GPU temperature

    °C

    Natural number

    GPU Power

    DCGM_FI_DEV_POWER_USAGE

    GPU power

    Watt (W)

    > 0

    GPU Memory Temperature

    DCGM_FI_DEV_MEMORY_TEMP

    GPU memory temperature

    °C

    Natural number

    InfiniBand or RoCE network

    Total Amount of Data Received by a NIC

    ma_node_infiniband_port_received_data_bytes_total

    The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

    (counting in double words, 32 bits)

    ≥ 0

    Total Amount of Data Sent by a NIC

    ma_node_infiniband_port_transmitted_data_bytes_total

    The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

    (counting in double words, 32 bits)

    ≥ 0

    NFS mounting status

    NFS Getattr Congestion Time

    ma_node_mountstats_getattr_backlog_wait

    Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

    ms

    ≥ 0

    NFS Getattr Round Trip Time

    ma_node_mountstats_getattr_rtt

    Getattr is an NFS operation that retrieves the attributes of a file or directory, such as size, permissions, owner, etc.

    RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

    ms

    ≥ 0

    NFS Access Congestion Time

    ma_node_mountstats_access_backlog_wait

    Access is an NFS operation that checks the access permissions of a file or directory for a given user. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

    ms

    ≥ 0

    NFS Access Round Trip Time

    ma_node_mountstats_access_rtt

    Access is an NFS operation that checks the access permissions of a file or directory for a given user. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

    ms

    ≥ 0

    NFS Lookup Congestion Time

    ma_node_mountstats_lookup_backlog_wait

    Lookup is an NFS operation that resolves a file name in a directory to a file handle. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

    ms

    ≥ 0

    NFS Lookup Round Trip Time

    ma_node_mountstats_lookup_rtt

    Lookup is an NFS operation that resolves a file name in a directory to a file handle. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

    ms

    ≥ 0

    NFS Read Congestion Time

    ma_node_mountstats_read_backlog_wait

    Read is an NFS operation that reads data from a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

    ms

    ≥ 0

    NFS Read Round Trip Time

    ma_node_mountstats_read_rtt

    Read is an NFS operation that reads data from a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

    ms

    ≥ 0

    NFS Write Congestion Time

    ma_node_mountstats_write_backlog_wait

    Write is an NFS operation that writes data to a file. Backlog wait is the time that the NFS requests have to wait in the backlog queue before being sent to the NFS server. It indicates the congestion on the NFS client side. A high backlog wait can cause poor NFS performance and slow system response times.

    ms

    ≥ 0

    NFS Write Round Trip Time

    ma_node_mountstats_write_rtt

    Write is an NFS operation that writes data to a file. RTT stands for Round Trip Time and it is the time from when the kernel RPC client sends the RPC request to the time it receives the reply34. RTT includes network transit time and server execution time. RTT is a good measurement for NFS latency. A high RTT can indicate network or server issues.

    ms

    ≥ 0

    Table 3 Diagnosis (InfiniBand, collected only in dedicated resource pools)

    Category

    Name

    Metric

    Description

    Unit

    Value Range

    InfiniBand or RoCE network

    PortXmitData

    infiniband_port_xmit_data_total

    The total number of data octets, divided by 4, (counting in double words, 32 bits), transmitted on all VLs from the port.

    Total count

    Natural number

    PortRcvData

    infiniband_port_rcv_data_total

    The total number of data octets, divided by 4, (counting in double words, 32 bits), received on all VLs from the port.

    Total count

    Natural number

    SymbolErrorCounter

    infiniband_symbol_error_counter_total

    Total number of minor link errors detected on one or more physical lanes.

    Total count

    Natural number

    LinkErrorRecoveryCounter

    infiniband_link_error_recovery_counter_total

    Total number of times the Port Training state machine has successfully completed the link error recovery process.

    Total count

    Natural number

    PortRcvErrors

    infiniband_port_rcv_errors_total

    Total number of packets containing errors that were received on the port including:

    Local physical errors (ICRC, VCRC, LPCRC, and all physical errors that cause entry into the BAD PACKET or BAD PACKET DISCARD states of the packet receiver state machine)

    Malformed data packet errors (LVer, length, VL)

    Malformed link packet errors (operand, length, VL)

    Packets discarded due to buffer overrun (overflow)

    Total count

    Natural number

    LocalLinkIntegrityErrors

    infiniband_local_link_integrity_errors_total

    This counter indicates the number of retries initiated by a link transfer layer receiver.

    Total count

    Natural number

    PortRcvRemotePhysicalErrors

    infiniband_port_rcv_remote_physical_errors_total

    Total number of packets marked with the EBP delimiter received on the port.

    Total count

    Natural number

    PortRcvSwitchRelayErrors

    infiniband_port_rcv_switch_relay_errors_total

    Total number of packets received on the port that were discarded when they could not be forwarded by the switch relay for the following reasons:

    DLID mapping

    VL mapping

    Looping (output port = input port)

    Total count

    Natural number

    PortXmitWait

    infiniband_port_transmit_wait_total

    The number of ticks during which the port had data to transmit but no data was sent during the entire tick (either because of insufficient credits or because of lack of arbitration).

    Total count

    Natural number

    PortXmitDiscards

    infiniband_port_xmit_discards_total

    Total number of outbound packets discarded by the port because the port is down or congested.

    Total count

    Natural number

    Table 4 Metric names

    Classification

    Metric

    Description

    Container metrics

    modelarts_service

    Service to which a container belongs, which can be notebook, train, or infer

    instance_name

    Name of the pod to which the container belongs

    service_id

    Instance or job ID displayed on the page, for example, cf55829e-9bd3-48fa-8071-7ae870dae93a for a development environment

    9f322d5a-b1d2-4370-94df-5a87de27d36e for a training job

    node_ip

    IP address of the node to which the container belongs

    container_id

    Container ID

    cid

    Cluster ID

    container_name

    Name of the container

    project_id

    Project ID of the account to which the user belongs

    user_id

    User ID of the account to which the user who submits the job belongs

    pool_id

    ID of a resource pool corresponding to a physical dedicated resource pool

    pool_name

    Name of a resource pool corresponding to a physical dedicated resource pool

    logical_pool_id

    ID of a logical subpool

    logical_pool_name

    Name of a logical subpool

    gpu_uuid

    UUID of the GPU used by the container

    gpu_index

    Index of the GPU used by the container

    gpu_type

    Type of the GPU used by the container

    account_name

    Account name of the creator of a training, inference, or development environment task

    user_name

    Username of the creator of a training, inference, or development environment task

    task_creation_time

    Time when a training, inference, or development environment task is created

    task_name

    Name of a training, inference, or development environment task

    task_spec_code

    Specifications of a training, inference, or development environment task

    cluster_name

    CCE cluster name

    Node metrics

    cid

    ID of the CCE cluster to which the node belongs

    node_ip

    IP address of the node

    host_name

    Hostname of a node

    pool_id

    ID of a resource pool corresponding to a physical dedicated resource pool

    project_id

    Project ID of the user in a physical dedicated resource pool

    gpu_uuid

    UUID of a node GPU

    gpu_index

    Index of a node GPU

    gpu_type

    Type of a node GPU

    device_name

    Device name of an InfiniBand or RoCE network NIC

    port

    Port number of the InfiniBand NIC

    physical_state

    Status of each port on the InfiniBand NIC

    firmware_version

    Firmware version of the InfiniBand NIC

    filesystem

    NFS-mounted file system

    mount_point

    NFS mount point

    Diagnos

    cid

    ID of the CCE cluster to which the node with the GPU equipped belongs

    node_ip

    IP address of the node where the GPU resides

    pool_id

    ID of a resource pool corresponding to a physical dedicated resource pool

    project_id

    Project ID of the user in a physical dedicated resource pool

    gpu_uuid

    GPU UUID

    gpu_index

    Index of a node GPU

    gpu_type

    Type of a node GPU

    device_name

    Name of a network device or disk device

    port

    Port number of the InfiniBand NIC

    physical_state

    Status of each port on the InfiniBand NIC

    firmware_version

    Firmware version of the InfiniBand NIC