Using Cloud Eye to Monitor NPU Events of Lite Servers
Description
You can use Cloud Eye to centrally collect key events and cloud resource operational events. When an event occurs, you will receive an alarm. Lite Servers mainly support BMS and ECS events. The table below lists NPU-related events. For details about other events, see Events Supported by Event Monitoring.
Constraints
- The Cloud Eye Agent plugin, which has strict resource usage restrictions, is required for event reporting to Cloud Eye. When the resource usage exceeds the threshold, the Agent circuit breaker occurs. For details about the resource usage, see Cloud Eye Server Monitoring.
- You have fully tested the monitoring agent in the public image provided by Lite Servers. If you use your own image, perform the test before deploying the image in the production environment to prevent information errors.
Prerequisites
The Cloud Eye Agent has been installed on the Lite Server. For details about how to check whether the Cloud Eye Agent is installed and how to install it, see Installing Cloud Eye Agent Monitoring Plug-ins.
Event Source
BMS/ECS
Event Namespace
SYS.BMS/SYS.ECS/service.ModelArts
Event List
| Event | Event ID | Event Severity | Description | Solution | Impact | Supported Model | Supported Cloud Eye Agent Version |
|---|---|---|---|---|---|---|---|
| NPU: device not found by npu-smi info | NPUSMICardNotFound | Major | The Ascend driver is faulty or the NPU is disconnected. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUSMICardNotFound Events" | The NPU cannot be used normally. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| NPU: PCIe link error | PCIeErrorFound | Major | The lspci command output shows that the NPU is in the rev ff state. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for PCIeErrorFound Events" | The NPU cannot be used normally. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| NPU: device not found by lspci | LspciCardNotFound | Major | The NPU is disconnected. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for LspciCardNotFound Events" | The NPU cannot be used normally. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| NPU: overtemperature | TemperatureOverUpperLimit | Major | The temperature of DDR or software is too high. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events" | The instance may be powered off and devices may not be found. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| NPU: uncorrectable ECC error | UncorrectableEccErrorCount | Major | There are uncorrectable ECC errors on the NPU. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events" | Services may be interrupted. | Snt3P Snt3PD | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| NPU: request for instance restart | RebootVirtualMachine | Suggestion | A fault occurs and the instance needs to be restarted. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events" | Services may be interrupted. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| NPU: request for SoC reset | ResetSOC | Suggestion | A fault occurs and the SoC needs to be reset. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events" | Services may be interrupted. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| NPU: request for restart AI process | RestartAIProcess | Suggestion | A fault occurs and the AI process needs to be restarted. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events" | The current AI task will be interrupted. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| NPU: error codes | NPUErrorCodeWarning | Major | A large number of NPU error codes indicating major or higher-level errors are returned. You can further locate the faults based on the error codes. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NPUErrorCodeWarning Events" | Services may be interrupted. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.4.3 2.7.5.3 2.7.5.4 2.7.5.9 or later |
| Multiple NPU HBM ECC errors | NpuHbmMultiEccInfo | Suggestion | There are NPU HBM ECC errors. | This event is only a reference for other events. You do not need to handle it separately. | This event is only a reference for other events. You do not need to handle it separately. | Snt9b Snt9b23 | telescope: 2.7.5.9 or later |
| GPU: invalid RoCE NIC configuration | GpuRoceNicConfigIncorrect | Major | GPU: invalid RoCE NIC configuration | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for GpuRoceNicConfigIncorrect Events" | The parameter plane network is abnormal, preventing the execution of the multi-node task. | GPU | telescope: 2.7.5.9 or later |
| ReadOnly issues in OS | ReadOnlyFileSystem | Critical | The file system %s is read-only. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for ReadOnlyFileSystem Events" | The files cannot be written or operated. | - | telescope: 2.7.5.3 2.7.5.9 or later |
| NPU: driver and firmware not matching | NpuDriverFirmwareMismatch | Major | The NPU's driver and firmware do not match. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuDriverFirmwareMismatch Events" | NPUs cannot be used. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.5.3 2.7.5.9 or later |
| NPU: RoCE NIC down | RoCELinkStatusDown | Major | The RoCE link of NPU %d is down. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for RoCELinkStatusDown Events" | The NPU NIC is unavailable. | Snt9b Snt9b23 | telescope: 2.7.5.3 2.7.5.9 or later |
| NPU: RoCE NIC health status abnormal | RoCEHealthStatusError | Major | The RoCE network health status of NPU %d is abnormal. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for RoCEHealthStatusError Events" | The NPU NIC is unavailable. | Snt9b Snt9b23 | telescope: 2.7.5.3 2.7.5.9 or later |
| NPU: RoCE NIC configuration file /etc/hccn.conf not exist | HccnConfNotExisted | Major | The RoCE NIC configuration file /etc/hccn.conf does not exist. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for HccnConfNotExisted Events" | The RoCE NIC is unavailable. | Snt9b Snt9b23 | telescope: 2.7.5.3 2.7.5.9 or later |
| GPU: basic components abnormal | GpuEnvironmentSystem | Major | The nvidia-smi command is abnormal. | 1. Check whether the driver version is compatible with the GPU. Access the NVIDIA official website to check the GPU model and driver version. https://www.nvidia.cn/drivers/lookup/ 2. Run lsmod | grep nvidia to check whether the NVIDIA kernel module is loaded. If not, run sudo modprobe nvidia to manually load it. 3. Reinstall the GPU driver. 4. If the fault persists, submit a service ticket and contact the O&M engineer. | The GPU driver is unavailable. | GPU | telescope: 2.7.5.3 2.7.5.9 or later |
| Major | The nvidia-fabricmanager version was inconsistent with the GPU driver version. | Check whether the GPU driver version matches the nvidia-fabricmanager version. | The nvidia-fabricmanager cannot work properly, affecting GPU usage. | ||||
| Major | The container plugin nvidia-container-toolkit is not installed. | Install and configure the NVIDIA Container Toolkit by referring to its installation guide. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html | Docker cannot attach GPUs. | ||||
| Local disk attachment inspection | MountDiskSystem | Major | The /etc/fstab file contains invalid UUIDs. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for MountDiskSystem Events" | The disk attachment error resulted in abnormal server restart. | - | telescope: 2.7.5.3 2.7.5.9 or later |
| GP: incorrectly configured dynamic route for Ant series server | GpuRouteConfigError | Major | The dynamic route of the NIC %s of an Ant series server is not configured or is incorrectly configured. CMD [ip route]: %s | CMD [ip route show table all]: %s. | Submit a service ticket and contact the O&M engineer. | The NPU network communication is abnormal. | GPU | telescope: 2.7.5.3 2.7.5.9 or later |
| NPU: RoCE port not split | RoCEUdpConfigError | Major | The RoCE UDP port is not split. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for RoCEUdpConfigError Events" | The communication performance of NPUs is affected. | Snt9b Snt9b23 | telescope: 2.7.5.9 or later |
| Warning of automatic system kernel upgrade | KernelUpgradeWarning | Major | Warning of automatic system kernel upgrade. Old version: %s; new version: %s. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for KernelUpgradeWarning Events" | The AI software may be unavailable. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.5.3 2.7.5.9 or later |
| NPU environment command detection | NpuToolsWarning | Major | The hccn_tool is unavailable. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuToolsWarning Events" | The IP address and gateway of the RoCE NIC cannot be configured. | Snt9b Snt9b23 | telescope: 2.7.5.3 2.7.5.9 or later |
| Major | The npu-smi is unavailable. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuToolsWarning Events" | NPUs cannot be used. | Snt3P Snt3PD Snt9b Snt9b23 | telescope: 2.7.5.3 2.7.5.9 or later | ||
| Major | The ascend-dmi is unavailable. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuToolsWarning Events" | The ascend-dmi cannot be used for performance analysis. | Snt9b Snt9b23 | telescope: 2.7.5.3 2.7.5.9 or later | ||
| NPU: L1 switch port partial failure | NpuL1SwitchPortPartialFunctionFailure | Major | Some functions of the NPU's L1 1520 switch port fail. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuL1SwitchPortPartialFunctionFailure Events" | Services may be interrupted. | Snt9b23 | telescope: 2.7.5.9 or later lqdcmi: 2.1.0 and later |
| NPU: L1 switch fault | NpuL1SwitchFault | Major | There are faults in the L1 1520 switch of the NPU. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuL1SwitchFault Events" | Services may be interrupted. | Snt9b23 | telescope: 2.7.5.9 or later lqdcmi: 2.1.0 and later |
| NPU: Unmatched RoCE IP address | NpuRoceIPAddressMismatch | Major | The actual IP address of the RoCE NIC is inconsistent with the IP address in the hccn.conf configuration file. | Reference: ModelArts Troubleshooting > "Lite Server" > "Handling Suggestions for NpuRoceIPAddressMismatch Events" | The parameter plane network is abnormal, preventing the execution of the multi-node task. | Snt9b Snt9b23 | telescope: 2.7.5.9 or later |
| VRD target version differs from the current version | InconsistentNpuVrdVersion | Major | VRD target version differs from the current version. | Upgrade VRD by powering off, waiting 5 minutes, then powering it back on. | The event has no short-term impacts, but it may affect NPU stability over time. | Snt9b Snt9b23 | - |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot