Using Cloud Eye to Monitor Lite Server NPU Events
Scenario
You can use Cloud Eye to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Server mainly supports BMS and ECS events. The table below lists NPU-related events. For details about other events, see Events Supported by Event Monitoring.
Constraints
- The Cloud Eye Agent plugin, which has strict resource usage restrictions, is required for event reporting to Cloud Eye. When the resource usage exceeds the threshold, the Agent circuit breaker occurs. For details about the resource usage, see Cloud Eye Server Monitoring.
- You have fully tested the monitoring agent in the public image provided by Lite Server. If you use your own image, perform the test before deploying the image in the production environment to prevent information errors.
Prerequisites
The Cloud Eye Agent has been installed on the Lite Server. For details about how to check whether the Cloud Eye Agent is installed and how to install it, see Installing Cloud Eye Agent Monitoring Plug-ins.
Event Source
BMS/ECS
Event Namespace
SYS.BMS/SYS.ECS
Event List
|
Event |
Event ID |
Event Severity |
Description |
Solution |
Impact |
Supported Model |
Supported Cloud Eye Agent Version |
|---|---|---|---|---|---|---|---|
|
NPU: device not found by npu-smi info |
NPUSMICardNotFound |
Major |
The Ascend driver is faulty or the NPU is disconnected. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
NPU: PCIe link error |
PCIeErrorFound |
Major |
The lspci command output shows that the NPU is in the rev ff state. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
NPU: device not found by lspci |
LspciCardNotFound |
Major |
The NPU is disconnected. |
Contact O&M engineers. |
The NPU cannot be used normally. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
NPU: overtemperature |
TemperatureOverUpperLimit |
Major |
The temperature of DDR or software is too high. |
Stop services, restart the system, check the heat dissipation system, and reset the device. |
The instance may be powered off and devices may not be found. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
NPU: uncorrectable ECC error |
UncorrectableEccErrorCount |
Major |
There are uncorrectable ECC errors on the NPU. |
If services are affected, replace the NPU with another one. |
Services may be interrupted. |
Snt3P Snt3PD |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
NPU: request for instance restart |
RebootVirtualMachine |
Suggestion |
A fault occurs and the instance needs to be restarted. |
Collect the fault information, and restart the instance. |
Services may be interrupted. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
NPU: request for SoC reset |
ResetSOC |
Suggestion |
A fault occurs and the SoC needs to be reset. |
Collect the fault information, and reset the SoC. |
Services may be interrupted. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
NPU: request for restart AI process |
RestartAIProcess |
Suggestion |
A fault occurs and the AI process needs to be restarted. |
Collect the fault information, and restart the AI process. |
The current AI task will be interrupted. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
NPU: error codes |
NPUErrorCodeWarning |
Major |
A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes. |
Locate the faults according to the Black Box Error Code Information List and Health Management Error Definition. |
Services may be interrupted. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
|
Multiple NPU HBM ECC errors |
NpuHbmMultiEccInfo |
Suggestion |
There are NPU HBM ECC errors. |
This event is only a reference for other events. You do not need to handle it separately. |
This event is only a reference for other events. You do not need to handle it separately. |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
|
GPU: invalid RoCE NIC configuration |
GpuRoceNicConfigIncorrect |
Major |
GPU: invalid RoCE NIC configuration |
Contact O&M engineers. |
The parameter plane network is abnormal, preventing the execution of the multi-node task. |
GPU |
telescope: 2.7.5.9 or later |
|
ReadOnly issues in OS |
ReadOnlyFileSystem |
Critical |
The file system %s is read-only. |
Check the disk health status. |
The files cannot be written or operated. |
- |
telescope: 2.7.5.3 2.7.5.9 or later |
|
NPU: driver and firmware not matching |
NpuDriverFirmwareMismatch |
Major |
The NPU's driver and firmware do not match. |
Obtain the matched version from the Ascend official website and reinstall it. |
NPUs cannot be used. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
|
NPU: RoCE NIC down |
RoCELinkStatusDown |
Major |
The RoCE link of NPU %d is down. |
Check the NPU RoCE network port status. |
The NPU NIC is unavailable. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
|
NPU: RoCE NIC health status abnormal |
RoCEHealthStatusError |
Major |
The RoCE network health status of NPU %d is abnormal. |
Check the health status of the NPU RoCE NIC. |
The NPU NIC is unavailable. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
|
NPU: RoCE NIC configuration file /etc/hccn.conf not exist |
HccnConfNotExisted |
Major |
The RoCE NIC configuration file /etc/hccn.conf does not exist. |
Check the /etc/hccn.conf NIC configuration file. |
The RoCE NIC is unavailable. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
|
GPU: basic components abnormal |
GpuEnvironmentSystem |
Major |
The nvidia-smi command is abnormal. |
Check whether the GPU driver is normal. |
The GPU driver is unavailable. |
GPU |
telescope: 2.7.5.3 2.7.5.9 or later |
|
Major |
The nvidia-fabricmanager version was inconsistent with the GPU driver version. |
Check the GPU driver version and nvidia-fabricmanager version. |
The nvidia-fabricmanager cannot work properly, affecting GPU usage. |
||||
|
Major |
The container plugin nvidia-container-toolkit is not installed. |
Install the container plugin nvidia-container-toolkit. |
GPUs cannot be attached to Docker containers. |
||||
|
Local disk mounting inspection |
MountDiskSystem |
Major |
The /etc/fstab file contains invalid UUIDs. |
Ensure that the UUIDs in the /etc/fstab configuration file are correct. Otherwise, the server may fail to be restarted. |
The disk mounting process fails, preventing the server from restarting. |
- |
telescope: 2.7.5.3 2.7.5.9 or later |
|
GP: incorrectly configured dynamic route for Ant series server |
GpuRouteConfigError |
Major |
The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s. |
Configure the RoCE NIC route correctly. |
The NPU network communication is abnormal. |
GPU |
telescope: 2.7.5.3 2.7.5.9 or later |
|
NPU: RoCE port not split |
RoCEUdpConfigError |
Major |
The RoCE UDP port is not split. |
Check the RoCE UDP port configuration on the NPU. |
The communication performance of NPUs is affected. |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
|
Warning of automatic system kernel upgrade |
KernelUpgradeWarning |
Major |
Warning of automatic system kernel upgrade. Old version: %s; new version: %s. |
System kernel upgrade may cause AI software exceptions. Check the system update logs and prevent the server from restarting. |
The AI software may be unavailable. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
|
NPU environment command detection |
NpuToolsWarning |
Major |
The hccn_tool is unavailable. |
Check if the NPU driver is normal. |
The IP address and gateway of the RoCE NIC cannot be configured. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
|
Major |
The npu-smi is unavailable. |
Check if the NPU driver is normal. |
NPUs cannot be used. |
Snt3P Snt3PD Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||
|
Major |
The ascend-dmi is unavailable. |
Check if ToolBox is properly installed. |
The ascend-dmi cannot be used for performance analysis. |
Snt9b Snt9b23 |
telescope: 2.7.5.3 2.7.5.9 or later |
||
|
NPU: L1 switch port partial failure |
NpuL1SwitchPortPartialFunctionFailure |
Major |
Some functions of the NPU's L1 1520 switch port fail. |
Transfer this issue to the Ascend or hardware team for handling. |
Services may be interrupted. |
Snt9b23 |
telescope: 2.7.5.9 or later lqdcmi: 2.1.0 and later |
|
NPU: L1 switch fault |
NpuL1SwitchFault |
Major |
There are faults in the L1 1520 switch of the NPU. |
Transfer this issue to the Ascend or hardware team for handling. |
Services may be interrupted. |
Snt9b23 |
telescope: 2.7.5.9 or later lqdcmi: 2.1.0 and later |
|
NPU: Unmatched RoCE IP address |
NpuRoceIPAddressMismatch |
Major |
The actual IP address of the RoCE NIC is inconsistent with the IP address in the hccn.conf configuration file. |
Contact O&M engineers. |
The parameter plane network is abnormal, preventing the execution of the multi-node task. |
Snt9b Snt9b23 |
telescope: 2.7.5.9 or later |
|
Server: Component info collection |
SysComponentInfo |
Suggestion |
The event collects server component details, including the driver and firmware versions, OS kernel, image name, ces-agent version, and SFS Turbo Client+ version. |
None |
None |
Snt3PD Snt9b Snt9b23 |
- |
|
NPU: PFC status abnormal |
NpuPfcStatusWarning |
Major |
The event occurred because the fifth PFC priority queue on the NPU was not configured with a value of 1. |
Run hccn_tool -i [device_id] -pfc -s bitmap 0,0,0,0,1,0,0,0 or contact AI Compute Service engineers. |
RoCE communication is abnormal, and services may be interrupted. |
Snt9b23 |
- |
|
NPU: TLS certificate status abnormal |
NpuTlsStatusWarning |
Major |
The certificate configurations of different NPUs are inconsistent. |
It is recommended that priority queues be either all enabled or all disabled. hccn_tool -i <device_id> -tls -s enable 0 hccn_tool -i <device_id> -tls -s enable 1 |
HCCL communication is abnormal, and services may be interrupted. |
Snt9b23 |
- |
|
NPU: HCCS health status abnormal |
NpuHccsHealthWarning |
Major |
The event occurred because the NPU HCCS health status is abnormal (not OK). |
Transfer this issue to the Ascend or hardware team for handling. |
This event causes data synchronization between NPUs to fail. As a result, the training task is interrupted or the inference result is incorrect. |
Snt9b23 |
- |
|
SDI NIC: Status abnormal |
SdiCheckWarning |
Major |
The event occurred because the SDI NIC was neither UP or RUNNING. |
|
Services may be interrupted. |
Snt9b Snt9b23 |
- |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot