Help Center/ Cloud Eye/ FAQs/ Server Monitoring/ Metrics/ BMS Hardware Metrics
Updated on 2024-01-11 GMT+08:00

BMS Hardware Metrics

The following table describes BMS hardware monitoring metrics and how the metrics are collected.

Metrics

Description

Collected by

Server information

Includes the server SN, product name, manufacturer.

Running the dmidecode command

Solid state drive (SSD) and hard disk drive (HDD) basic information and Self-Monitoring Analysis and Reporting Technology (SMART) information

Includes basic information (such as the SN, model, capacity, protocol type, and firmware version) and indicators (such as the health status, temperature, number of bad blocks, number of errors, and number of failures) in the SMART log of the SSD and HDD.

Running the smartctl -a <Drive letter> command

Basic information about the Non-Volatile Memory Express (NVMe) SSD

Includes SN, model, capacity, and firmware version.

Running the nvme list command

Standard SMART information of the NVMe SSD

Includes indicators in the SMART log of the NVMe SSD (such as the health status, temperature, service life, number of errors, and number of failures).

Running the nvme smart-log <NVMe device name> command

Additional SMART information of the Huawei NVMe SSD

Includes more detailed indicators and counts (such as power consumption, capacitor status, the number of bad blocks, and numbers of different errors).

Running the hioadm info -d <NVMe device name> -a and hioadm info -d <NVMe device name> -e commands

Additional SMART information of Intel NVMe SSDs

Includes more detailed error counts.

Run the nvme intel smart-log-add <NVMe device name> command

Network interface status information

Includes the MAC address, link status, and lost & wrong packets at the receiving and sending ends.

Running the ifconfig <Network interface name> command

Network port device information

Includes the port type, link status, and network rate.

Running the ethtool <Network interface name> command

Network interface driver information

Includes the firmware version, driver version, and bus number.

Running the ethtool -i <Network interface name> command

Optical module information

Includes the basic device information (such as the SN, manufacturer, production date, connection type, encoding mode, and bandwidth) and device status information (such as offset current, input power, output power, voltage, and temperature).

Running the ethtool -m <Network interface name> command

Number of Huawei Intelligent NIC (HiNIC) port errors

HiLink errors, Base encoding errors, and RS encoding errors

Running the hinicadm hilink_port -i <dev_id> -p <port_id> -s and hinicadm hilink_count -i <dev_id> -p <port_id> commands

HiNIC card working mode

Current working mode and configured working mode

Running the hinicadm mode -i <dev_id> command

HiNIC card core temperature

Temperature of the HiNIC card core

Running the hinicadm temperature -i <dev_id> command

HiNIC card event records

Includes HiNIC card heartbeat losses, PCIe exceptions, chip errors, and chip health status.

Running the hinicadm event -i <dev_id> command

PCIe errors of the HiNIC card

Different PCIe errors of the HiNIC card

Running the hinicadm counter -i <dev_id> -t 4 command

Memory information

Includes the DIMM SN, manufacturer, Part Number (PN), bit width, capacity, and frequency.

Running the dmidecode -t 17 command

CPU information

Includes the CPU ID, name, frequency, architecture, and model.

Running the dmidecode -t 4 and lscpu commands

Memory error records

Memory CE/UCE error records, including the error type, fault code, error location information (chip, rank, bank, column, row), MCI ADDR register, MCI MISC register, MCG CAP register, MCG STATUS register, retry registers, and other registers.

Reading files such as /dev/mem, /dev/cpu/<core_id>/msr, and /sys/firmware/acpi/tables/HEST to collect memory error records and chip register information