Diagnosing Faults on Lite Servers
Description
The Task Center for Lite Servers provides one-click fault diagnosis capabilities, covering both parameter plane network diagnosis and Ascend device diagnosis. You can quickly perform network and Ascend device health checks directly from the product console without needing in-depth knowledge of specific diagnosis command-line operations.
For the parameter plane network diagnosis, you can query the network status, IP address, and mask information of the PU. For the Ascend device diagnosis, you can check the driver firmware version compatibility and implement automatic in-band check. You can batch start diagnosis tasks on multiple servers at the same time, improving efficiency to a large extent.
Constraints
- Currently, only Snt9b and Snt9b21 common nodes and Snt9b23 supernodes are supported.
- You can select a maximum of 50 common nodes or supernode child nodes for the same task.
- The NodeTaskHub plug-in is required for the node where the task is to be created. Ensure that the plug-in is installed before task creation. For details, see Installing the AI Plugin for a Lite Server.
- Only one diagnosis task can be executed on a node at the same time. The task cannot be interrupted once started. Plan the task priority.
- Ensure that no services are running on the nodes you are going to diagnose. Running commands during diagnosis can cause service interruptions or errors.
- Install the MCU, driver, and firmware for Ascend HDK 23.0.0 or later before starting the diagnosis. A preconfigured OS is already installed. If you use a custom OS, ensure that the software has been installed correctly.
- The diagnosis requires the Ascend-docker-runtime development kit. This software is pre-installed on the default OS. If you use a custom OS, ensure the software has been installed correctly.
Procedure
- Log in to the ModelArts console. In the navigation pane, choose Resource Management > Lite Servers. Click the Task Center tab.
- New console: In the navigation pane, choose Resource Management > Lite Compute Resources > Lite Servers.
- Old console: In the navigation pane, choose Resource Management > Lite Servers.
- Click Create Task in the upper right corner. On the displayed Job Templates page, locate Ascend Fault Diagnosis, and click Create Task. Figure 1 Task templates
- On the Ascend Fault Diagnosis page, enter the task name and description. Set server model and type, select a diagnosis item, select the notice, and click Create now.
Table 1 Parameters for creating a task Parameter
Description
Name
The system automatically generates a task name. You can change the name as required.
Description
Enter the task description for quick search.
Server Model
Select a server model and select nodes in the node list. You can search for node information using keywords.
Snt9b and Snt9b21 common nodes and Snt9b23 supernodes are supported.
Diagnosis Item
You can select Parameter Plane Network Diagnosis, Ascend Device Diagnosis, or both.
- Parameter Plane Network Diagnosis: Check and record parameter-plane network metrics and information.
- Ascend Device Diagnosis: Check the health and compatibility of Ascend software and chip metrics.
- View the task execution status in the Task Center tab.
- Click the task name to access its details page, where you can view the task details. Figure 2 Checking the task details
- On the task details page, locate the target node and click View Logs in the Operation column. In the displayed window on the right, view the detailed log about task execution. All check results are displayed in the task logs, and basic log analysis is provided. Figure 3 Viewing logs
In-band Automatic Check Items
The table below lists the in-band automatic check items in the Ascend device check task.
| Check Item | Command Reference | Action |
|---|---|---|
| Checking the UDP port split configuration | hccn_tool -i $i -udp -g | Check whether the port number is 0/4791. |
| Checking the NPU health information | timeout 20s npu-smi info -t health -i "$i" | grep OK -c | Only for Snt9b23. Check whether the NPU health code is 3. |
| Checking whether the NPU driver versions are consistent | timeout 20s npu-smi info -t board -i "$i" | grep Version | Check whether the driver numbers of all NPUs are the same. |
| Checking the PCIE link status | lspci | grep d8 / lspci | grep d8 -c | Only for Snt9b23. Check whether the PCIE link is 16. |
| Checking whether the NPU NIC is UP | hccn_tool -i $i -link -g | Check whether the NIC is down. |
| Checking the NPU NIC health status | hccn_tool -i $i -net_health -g | Check whether the NIC is healthy. |
| Checking whether the NPU PFC meets the requirements | hccn_tool -i $i -pfc -g | Check whether the PFC meets the requirements. The PFC configuration is as follows: |
| Checking whether the TLS certificate meets the requirements | hccn_tool -i $i -tls -g | grep switch | Check whether switch[0] in the field meets the requirements. |
| Driver and firmware version compatibility test | ascend-dmi -ci | Check whether the compatibility meets the requirements. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot