Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Using CES to Monitor Lite Server Resources

Updated on 2025-02-13 GMT+08:00

Scenario

You need Cloud Eye Service (CES) to monitor Lite Server. This section describes how to interconnect with CES to monitor resources and events on Lite Server.

Overview

For details, see BMS Overview. In addition to the images listed in the document, Ubuntu 20.04 is also supported.

The sampling period of monitoring metrics is 1 minute. The current monitoring metrics include the CPU, memory, disk, and network. After the accelerator card driver is installed on the host, the related metrics can be collected. Table 1 only displays NPU-related metrics. For other metrics, see Metrics Supported by the Agent.

Table 1 NPU metrics

No.

Category

Metric

Display Name

Description

Unit

Value Range

Dimension

Supported Model

1

Overall

npu_device_health

NPU Health Status

Health status of the NPU

-

0: normal

1: minor alarm

2: major alarm

3: critical alarm

instance_id, npu

Snt3P

300IDuo

Snt9B

Snt9C

2

npu_driver_health

NPU Driver Health Status

Health status of the NPU driver

-

0: normal

3: critical alarm

instance_id, npu

3

npu_power

NPU Power

NPU power

W

>0

instance_id, npu

4

npu_temperature

NPU Temperature

NPU temperature

°C

Natural number

instance_id, npu

5

npu_voltage

NPU Voltage

NPU voltage

V

Natural number

instance_id, npu

6

HBM

npu_util_rate_hbm

NPU HBM Usage

HBM usage of the NPU

%

0%–100%

instance_id, npu

Snt9B

Snt9C

7

npu_hbm_freq

HBM Frequency

NPU HBM frequency

MHz

>0

instance_id, npu

8

npu_hbm_usage

HBM Usage

NPU HBM usage

MB

≥0

instance_id, npu

9

npu_hbm_temperature

HBM Temperature

NPU HBM temperature

°C

Natural number

instance_id, npu

10

npu_hbm_bandwidth_util

HBM Bandwidth Usage

NPU HBM bandwidth usage

%

0%–100%

instance_id, npu

11

npu_hbm_mem_capacity

NPU HBM Memory Capacity

HBM memory capacity of the NPU

MB

≥0

instance_id, npu

12

npu_hbm_ecc_enable

HBM ECC Status

NPU HBM ECC status

-

0: ECC detection is disabled.

1: ECC detection is enabled.

instance_id, npu

13

npu_hbm_single_bit_error_cnt

Single-bit Errors on HBM

Current number of single-bit errors on the NPU HBM

count

≥0

instance_id, npu

14

npu_hbm_double_bit_error_cnt

Double-bit Errors on HBM

Current number of double-bit errors on the NPU HBM

count

≥0

instance_id, npu

15

npu_hbm_total_single_bit_error_cnt

Single-bit Errors in HBM Lifecycle

Number of single-bit errors in the NPU HBM lifecycle

count

≥0

instance_id, npu

16

npu_hbm_total_double_bit_error_cnt

Double-bit Errors in HBM Lifecycle

Number of double-bit errors in the NPU HBM lifecycle

count

≥0

instance_id, npu

17

npu_hbm_single_bit_isolated_pages_cnt

Isolated NPU Memory Pages with HBM Single-bit Errors

Number of isolated NPU memory pages with HBM single-bit errors

count

≥0

instance_id, npu

18

npu_hbm_double_bit_isolated_pages_cnt

Isolated NPU Memory Pages with HBM Multi-bit Errors

Number of isolated NPU memory pages with HBM double-bit errors

count

≥0

instance_id, npu

19

DDR

npu_usage_mem

Used NPU Memory

Used NPU memory

MB

≥0

instance_id, npu

Snt3P

300IDuo

20

npu_util_rate_mem

NPU Memory Usage

NPU memory usage

%

0%–100%

instance_id, npu

21

npu_freq_mem

NPU Memory Frequency

NPU memory frequency

MHz

>0

instance_id, npu

22

npu_util_rate_mem_bandwidth

NPU Memory Bandwidth Usage

NPU memory bandwidth usage

%

0%–100%

instance_id, npu

23

npu_sbe

NPU Single-bit Errors

Number of single-bit errors on the NPU

count

≥0

instance_id, npu

24

npu_dbe

NPU Double-bit Errors

Number of double-bit errors on the NPU

count

≥0

instance_id, npu

25

AI Core

npu_freq_ai_core

AI Core Frequency of the NPU

AI core frequency of the NPU

MHz

>0

instance_id, npu

Snt3P

300IDuo

Snt9B

Snt9C

26

npu_freq_ai_core_rated

Rated Frequency of the NPU AI Core

Rated frequency of the NPU AI core

MHz

>0

instance_id, npu

27

npu_util_rate_ai_core

AI Core Usage of the NPU

AI core usage of the NPU

%

0%–100%

instance_id, npu

28

AI CPU

npu_aicpu_num

AI CPUs of the NPU

Number of AI CPUs of the NPU

count

≥0

instance_id, npu

Snt3P

300IDuo

Snt9B

Snt9C

29

npu_util_rate_ai_cpu

AI CPU Usage of the NPU

AI CPU usage of the NPU

%

0%–100%

instance_id, npu

30

npu_aicpu_avg_util_rate

Average AI CPU Usage of the NPU

Average AICPU usage of the NPU

%

0%–100%

instance_id, npu

31

npu_aicpu_max_freq

Maximum AI CPU Frequency of the NPU

Maximum AI CPU frequency of the NPU

MHz

>0

instance_id, npu

32

npu_aicpu_cur_freq

AI CPU Frequency of the NPU

AI CPU frequency of the NPU

MHz

>0

instance_id, npu

33

CTRL CPU

npu_util_rate_ctrl_cpu

Control CPU Usage of the NPU

Control CPU usage of the NPU

%

0%–100%

instance_id, npu

Snt3P

300IDuo

Snt9B

Snt9C

34

npu_freq_ctrl_cpu

Control CPU Frequency of the NPU

Control CPU frequency of the NPU

MHz

>0

instance_id, npu

35

PCIe link

npu_link_cap_speed

Max. NPU Link Speed

Maximum link speed of the NPU

GT/s

≥0

instance_id, npu

310P

300IDuo

Snt9B

Snt9C

36

npu_link_cap_width

Max. NPU Link Width

Maximum link width of the NPU

count

≥0

instance_id, npu

37

npu_link_status_speed

NPU Link Speed

Link speed of the NPU

GT/s

≥0

instance_id, npu

38

npu_link_status_width

NPU Link Width

Link width of the NPU

count

≥0

instance_id, npu

39

RoCE network

npu_device_network_health

NPU Network Health Status

Connectivity of the IP address of the RoCE NIC on the NPU

-

0: The network health status is normal.

Other values: The network status is abnormal.

instance_id, npu

Snt9B

Snt9C

40

npu_network_port_link_status

NPU Network Port Link Status

Link status of the NPU network port

-

0: up

1: down

instance_id, npu

41

npu_roce_tx_rate

NPU NIC Uplink Rate

Uplink rate of the NPU NIC

MB/s

≥0

instance_id, npu

42

npu_roce_rx_rate

NPU NIC Downlink Rate

Downlink rate of the NPU NIC

MB/s

≥0

instance_id, npu

43

npu_mac_tx_mac_pause_num

PAUSE Frames Sent from MAC

Total number of PAUSE frames sent from the MAC address corresponding to the NPU

count

≥0

instance_id, npu

44

npu_mac_rx_mac_pause_num

PAUSE Frames Received by MAC

Total number of PAUSE frames received by the MAC address corresponding to the NPU

count

≥0

instance_id, npu

45

npu_mac_tx_pfc_pkt_num

PFC Frames Sent from MAC

Total number of PFC frames sent from the MAC address corresponding to the NPU

count

≥0

instance_id, npu

46

npu_mac_rx_pfc_pkt_num

PFC Frames Received by MAC

Total number of PFC frames received by the MAC address corresponding to the NPU

count

≥0

instance_id, npu

47

npu_mac_tx_bad_pkt_num

Bad Packets Sent from MAC

Total number of bad packets sent from the MAC address corresponding to the NPU

count

≥0

instance_id, npu

48

npu_mac_rx_bad_pkt_num

Bad Packets Received by MAC

Total number of bad packets received by the MAC address corresponding to the NPU

count

≥0

instance_id, npu

49

npu_roce_tx_err_pkt_num

Bad Packets Sent by RoCE

Total number of bad packets sent by the RoCE NIC on the NPU

count

≥0

instance_id, npu

50

npu_roce_rx_err_pkt_num

Bad Packets Received by RoCE

Total number of bad packets received by the RoCE NIC on the NPU

count

≥0

instance_id, npu

51

RoCE optical module

npu_opt_temperature

NPU Optical Module Temperature

NPU optical module temperature

°C

Natural number

instance_id, npu

Snt9B

Snt9C

52

npu_opt_temperature_high_thres

Upper Limit of the NPU Optical Module Temperature

Upper limit of the NPU optical module temperature

°C

Natural number

instance_id, npu

53

npu_opt_temperature_low_thres

Lower Limit of the NPU Optical Module Temperature

Lower limit of the NPU optical module temperature

°C

Natural number

instance_id, npu

54

npu_opt_voltage

NPU Optical Module Voltage

NPU optical module voltage

mV

Natural number

instance_id, npu

55

npu_opt_voltage_high_thres

Upper Limit of the NPU Optical Module Voltage

Upper limit of the NPU optical module voltage

mV

Natural number

instance_id, npu

56

npu_opt_voltage_low_thres

Lower Limit of the NPU Optical Module Voltage

Lower limit of the NPU optical module voltage

mV

Natural number

instance_id, npu

57

npu_opt_tx_power_lane0

TX Power of the NPU Optical Module in Channel 0

Transmit power of the NPU optical module in channel 0

mW

≥0

instance_id, npu

58

npu_opt_tx_power_lane1

TX Power of the NPU Optical Module in Channel 1

Transmit power of the NPU optical module in channel 1

mW

≥0

instance_id, npu

59

npu_opt_tx_power_lane2

TX Power of the NPU Optical Module in Channel 2

Transmit power of the NPU optical module in channel 2

mW

≥0

instance_id, npu

60

npu_opt_tx_power_lane3

TX Power of the NPU Optical Module in Channel 3

Transmit power of the NPU optical module in channel 3

mW

≥0

instance_id, npu

61

npu_opt_rx_power_lane0

RX Power of the NPU Optical Module in Channel 0

Receive power of the NPU optical module in channel 0

mW

≥0

instance_id, npu

62

npu_opt_rx_power_lane1

RX Power of the NPU Optical Module in Channel 1

Receive power of the NPU optical module in channel 1

mW

≥0

instance_id, npu

63

npu_opt_rx_power_lane2

RX Power of the NPU Optical Module in Channel 2

Receive power of the NPU optical module in channel 2

mW

≥0

instance_id, npu

64

npu_opt_rx_power_lane3

RX Power of the NPU Optical Module in Channel 3

Receive power of the NPU optical module in channel 3

mW

≥0

instance_id, npu

65

npu_opt_tx_bias_lane0

TX Bias Current of the NPU Optical Module in Channel 0

Transmitted bias current of the NPU optical module in channel 0

mA

≥0

instance_id, npu

66

npu_opt_tx_bias_lane1

TX Bias Current of the NPU Optical Module in Channel 1

Transmitted bias current of the NPU optical module in channel 1

mA

≥0

instance_id, npu

67

npu_opt_tx_bias_lane2

TX Bias Current of the NPU Optical Module in Channel 2

Transmitted bias current of the NPU optical module in channel 2

mA

≥0

instance_id, npu

68

npu_opt_tx_bias_lane3

TX Bias Current of the NPU Optical Module in Channel 3

Transmitted bias current of the NPU optical module in channel 3

mA

≥0

instance_id, npu

69

npu_opt_tx_los

TX Los of the NPU Optical Module

TX Los flag of the NPU optical module

count

≥0

instance_id, npu

70

npu_opt_rx_los

RX Los of the NPU Optical Module

RX Los flag of the NPU optical module

count

≥0

instance_id, npu

Supported Events

You can use CES to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Server mainly supports events from BMS. For details, see the following table.

Table 2 Events supported by Lite Server

Event Source

Namespace

Event

Event ID

Event Severity

Description

Solution

Impact

Supported Model

BMS

SYS.BMS

NPU: device not found by npu-smi info

NPUSMICardNotFound

Major

The Ascend driver is faulty or the NPU is disconnected.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

300IDuo

Snt9B

Snt9C

NPU: PCIe link error

PCIeErrorFound

Major

The lspci command output shows that the NPU is in the rev ff state.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

300IDuo

Snt9B

Snt9C

NPU: device not found by lspci

LspciCardNotFound

Major

The NPU is disconnected.

Contact O&M engineers.

The NPU cannot be used normally.

Snt3P

300IDuo

Snt9B

Snt9C

NPU: overtemperature

TemperatureOverUpperLimit

Major

The temperature of DDR or software is too high.

Stop services, restart the BMS, check the heat dissipation system, and reset the devices.

The instance may be powered off and devices may not be found.

Snt3P

300IDuo

Snt9B

Snt9C

NPU: uncorrectable ECC error

UncorrectableEccErrorWarning

Major

There are uncorrectable ECC errors on the NPU.

If services are affected, replace the NPU with another one.

Services may be interrupted.

Snt3P

300IDuo

NPU: request for instance restart

RebootVirtualMachine

Suggestion

A fault occurs and the BMS needs to be restarted.

Collect the fault information, and restart the BMS.

Services may be interrupted.

Snt3P

300IDuo

Snt9B

Snt9C

NPU: request for SoC reset

ResetSOC

Suggestion

A fault occurs and the SoC needs to be reset.

Collect the fault information, and reset the SoC.

Services may be interrupted.

Snt3P

300IDuo

Snt9B

Snt9C

NPU: request for restart AI process

RestartAIProcess

Suggestion

A fault occurs and the AI process needs to be restarted.

Collect the fault information, and restart the AI process.

The current AI task will be interrupted.

Snt3P

300IDuo

Snt9B

Snt9C

NPU: error codes

NPUErrorCodeWarning

Major

A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes.

Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition.

Services may be interrupted.

Snt3P

300IDuo

Snt9B

Snt9C

Multiple NPU HBM ECC errors

NpuHbmMultiEccInfo

Suggestion

There are NPU HBM ECC errors.

This event is only a reference for other events. You do not need to handle it separately.

This event is only a reference for other events. You do not need to handle it separately.

Snt9B

Snt9C

GPU: invalid RoCE NIC configuration

GpuRoceNicConfigIncorrect

Major

GPU: invalid RoCE NIC configuration

Contact O&M engineers.

The parameter plane network is abnormal, preventing the execution of the multi-node task.

GPU

ReadOnly issues in OS

ReadOnlyFileSystem

Critical

The file system %s is read-only.

Check the disk health status.

The files cannot be written or operated.

-

NPU: driver and firmware not matching

NpuDriverFirmwareMismatch

Major

The NPU's driver and firmware do not match.

Obtain the matched version from the Ascend official website and reinstall it.

NPUs cannot be used.

Snt3P

300IDuo

Snt9B

Snt9C

NPU: Docker container environment check

NpuContainerEnvSystem

Major

Docker unavailable

Check if the Docker software is normal.

Docker cannot be used.

-

Major

The container plug-in Ascend-Docker-Runtime is not installed.

Install the container plug-in Ascend-Docker-Runtime. Otherwise, the container cannot use Ascend cards.

NPUs cannot be mounted to Docker containers.

Snt3P

300IDuo

Snt9B

Snt9C

Major

IP forwarding is not enabled in the OS.

Check the net.ipv4.ip_forward configuration in the /etc/sysctl.conf file.

Docker containers experience network communication issues.

-

Major

The shared memory of the container is too small.

The default shared memory is 64 MB, which can be modified as needed.

Distributed training failed due to insufficient shared memory.

-

Method 1

Modify the default-shm-size field in the /etc/docker/daemon.json configuration file.

Method 2

Use the --shm-size parameter in the docker run command to set the shared memory size of a container.

NPU: RoCE NIC down

RoCELinkStatusDown

Major

The RoCE link of NPU card %d is down.

Check the NPU RoCE network port status.

The NPU NIC is unavailable.

Snt9B

Snt9C

NPU: RoCE NIC health status abnormal

RoCEHealthStatusError

Major

The RoCE network health status of NPU %d is abnormal.

Check the health status of the NPU RoCE NIC.

The NPU NIC is unavailable.

Snt9B

Snt9C

NPU: RoCE NIC configuration file /etc/hccn.conf not exist

HccnConfNotExisted

Major

The RoCE NIC configuration file /etc/hccn.conf does not exist.

Check the /etc/hccn.conf NIC configuration file.

The RoCE NIC is unavailable.

Snt9B

Snt9C

GPU: basic components abnormal

GpuEnvironmentSystem

Major

The nvidia-smi command is abnormal.

Check if the GPU driver is normal.

The GPU driver is unavailable.

GPU

Major

The nvidia-fabricmanager version is inconsistent with the GPU driver version.

Check the GPU driver version and nvidia-fabricmanager version.

The nvidia-fabricmanager cannot work properly, affecting GPU usage.

Major

The container plug-in nvidia-container-toolkit is not installed.

Install the container plug-in nvidia-container-toolkit.

GPUs cannot be mounted to Docker containers.

Local disk mounting inspection

MountDiskSystem

Major

The /etc/fstab file contains invalid UUIDs.

Ensure that the UUIDs in the /etc/fstab configuration file are correct. Otherwise, the server may fail to be restarted.

The disk mounting process fails, preventing the server from restarting.

-

GPU: incorrectly configured dynamic route for Ant series servers

GpuRouteConfigError

Major

The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s.

Configure the RoCE NIC route correctly.

The NPU network communication is abnormal.

GPU

NPU: RoCE port not split

RoCEUdpConfigError

Major

The RoCE UDP port is not split.

Check the RoCE UDP port configuration on the NPU.

The communication performance of NPUs is affected.

Snt9B

Snt9C

Warning of automatic system kernel upgrade

KernelUpgradeWarning

Major

Warning of automatic system kernel upgrade. Old version: %s; new version: %s.

System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting.

The AI software may be unavailable.

Snt3P

300IDuo

Snt9B

Snt9C

NPU environment command detection

NpuToolsWarning

Major

The hccn_tool is unavailable.

Check if the NPU driver is normal.

The IP address and gateway of the RoCE NIC cannot be configured.

Snt9B

Snt9C

Major

The npu-smi is unavailable.

Check if the NPU driver is normal.

NPUs cannot be used.

Snt3P

300IDuo

Snt9B

Snt9C

Major

The ascend-dmi is unavailable.

Check if ToolBox is properly installed.

ascend-dmi cannot be used for performance analysis.

Snt9B

Snt9C

Installing CES Agent Monitoring Plug-ins

  1. Create an agency for CES. For details, see Creating a User and Granting Permissions.
  2. Currently, one-click monitoring installation is not supported on the CES page. You need to log in to the server and run the following commands to install and configure the agent. For details about how to install the agent in other regions, see Installing the Agent on a Linux Server.

    cd /usr/local && curl -k -O https://obs.cn-north-4.myhuaweicloud.com/uniagent-cn-north-4/script/agent_install.sh && bash agent_install.sh

    If the following information is displayed, the installation is successful.

    Figure 1 Installation succeeded

  3. On the Cloud Eye console, choose Service Monitoring > Bare Metal Server to view the monitoring items. Accelerator card monitoring items are only available after the accelerator card driver is installed on the host.

    Figure 2 Monitoring page

    The monitoring plug-in is now installed. You can view the collected metrics on the UI or configure alarms based on the metric values.

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback