Updated on 2024-04-15 GMT+08:00

Host Metrics and Dimensions

Table 1 Host metrics

Metric

Description

Value Range

Unit

Total CPU cores (aom_node_cpu_limit_core)

Total number of CPU cores that have been applied for a measured object

≥ 1

Cores

Used CPU cores (aom_node_cpu_used_core)

Number of CPU cores used by a measured object

≥ 0

Cores

CPU usage (aom_node_cpu_usage)

CPU usage of a measured object

0–100

%

Available physical memory (aom_node_memory_free_megabytes)

Available physical memory of a measured object

≥ 0

MB

Available virtual memory (aom_node_virtual_memory_free_megabytes)

Available virtual memory of a measured object

≥ 0

MB

Total GPU memory (aom_node_gpu_memory_free_megabytes)

Total GPU memory of a measured object

> 0

MB

GPU memory usage (aom_node_gpu_memory_usage)

Percentage of the used GPU memory to the total GPU memory

0–100

%

Used GPU memory (aom_node_gpu_memory_used_megabytes)

GPU memory used by a measured object

≥ 0

MB

GPU usage (aom_node_gpu_usage)

GPU usage of a measured object

0–100

%

Total NPU memory (aom_node_npu_memory_free_megabytes)

Total NPU memory of a measured object

> 0

MB

NPU memory usage (aom_node_npu_memory_usage)

Percentage of the used NPU memory to the total NPU memory

0–100

%

Used NPU memory (aom_node_npu_memory_used_megabytes)

NPU memory used by a measured object

≥ 0

MB

NPU usage (aom_node_npu_usage)

NPU usage of a measured object

0–100

%

NPU temperature (aom_node_npu_temperature_centigrade)

NPU temperature of a measured object

-

°C

Physical memory usage (aom_node_memory_usage)

Percentage of the used physical memory to the total physical memory

0–100

%

Host status (aom_node_status)

Host status

  • 0: Normal
  • 1: Abnormal

N/A

NTP offset (aom_node_ntp_offset_ms)

Offset between the local time of the host and the NTP server time. The closer the NTP offset is to 0, the closer the local time of the host is to the time of the NTP server.

-

ms

NTP server status (aom_node_ntp_server_status)

Whether the host is connected to the NTP server

0 or 1

  • 0: Connected
  • 1: Unconnected

N/A

NTP synchronization status (aom_node_ntp_status)

Whether the local time of the host is synchronized with the NTP server time

0 or 1

  • 0: Synchronous
  • 1: Not synchronized

N/A

Processes (aom_node_process_number)

Number of processes on a measured object

≥ 0

N/A

GPU temperature (aom_node_gpu_temperature_centigrade)

GPU temperature of a measured object

-

°C

Total physical memory (aom_node_memory_total_megabytes)

Total physical memory that has been applied for a measured object

≥ 0

MB

Total virtual memory (aom_node_virtual_memory_total_megabytes)

Total virtual memory that has been applied for a measured object

≥ 0

MB

Virtual memory usage (aom_node_virtual_memory_usage)

Percentage of the used virtual memory to the total virtual memory

0–100

%

Threads (aom_node_current_threads_num)

Number of threads created on a host

≥ 0

N/A

Max. threads (aom_node_sys_max_threads_num)

Maximum number of threads that can be created on a host

≥ 0

N/A

Total physical disk space (aom_node_phy_disk_total_capacity_megabytes)

Total disk space of a host

≥ 0

MB

Used disk space (aom_node_physical_disk_total_used_megabytes)

Used disk space of a host

≥ 0

MB

Hosts (aom_billing_hostUsed)

Number of hosts connected per day

≥ 0

N/A

  • AOM can collect NPU metrics (total storage space, storage usage, used storage space, NPU usage, and temperature) of Ascend Snt9 and D710 hosts only.
  • Memory usage = (Physical memory capacity – Available physical memory capacity)/Physical memory capacity; Virtual memory usage = ((Physical memory capacity + Total virtual memory capacity) – (Available physical memory capacity + Available virtual memory capacity))/(Physical memory capacity + Total virtual memory capacity)
  • The virtual memory of a VM is 0 MB by default. If no virtual memory is configured, the memory usage on the monitoring page is the same as the virtual memory usage.
  • For the total and used physical disk space, only the space of the local disk partitions' file systems is counted. The file systems (such as JuiceFS, NFS, and SMB) mounted to the host through the network are not taken into account.
Table 2 Dimensions of host metrics

Dimension

Description

clusterId

Cluster ID

clusterName

Cluster name

gpuName

GPU name

gpuID

GPU ID

npuName

NPU name

npuID

NPU ID

hostID

Host ID

nameSpace

Cluster namespace

nodeIP

Host IP address

hostName

Host name