Performing Health Check on Lite Servers
Description
During routine O&M, O&M engineers may face many challenges in pressure tests, health checks, fault detection, and log analysis. To improve O&M efficiency, Lite Servers interconnect with AI Compute Service Brain to comprehensively inspect and manage Lite Servers. You can directly create a health check job on the Lite Servers page. In this way, O&M engineers can access the AI Compute Service Brain page, and create and execute a health check job, improving system O&M capabilities and reliability.
Constraints
- Currently, only Snt9b and Snt9b21 common nodes and Snt9b23 supernodes are supported.
- The Lite Server node is in the Running state.
Prerequisites
The NodeTaskHub plugin has been installed for the Lite Server on which you want to create a health check task. For details, see Installing the AI Plugin for a Lite Server.
Creating an Inspection Job
- Log in to the ModelArts console. In the navigation pane, choose Resource Management > Lite Servers.
- New console: In the navigation pane, choose Resource Management > Lite Compute Resources > Lite Servers.
- Old console: In the navigation pane, choose Resource Management > Lite Servers.
- In the Common node list, choose More > Create Health Check Task in the Operation column on the right. On the displayed page, configure parameters.
Table 1 Parameters for creating a health check task Configuration Item
Parameter
Description
Basic Configuration
Assignment Name
Enter a custom inspection task name.
Inspection Object
Select Object
You can select common nodes, supernodes, or an entire rack of nodes. Select the target nodes in the list. You can select at most 48 nodes.
Inspection Type
Standard Inspection
Minute-scale check that does not affect jobs on nodes.
Deep Inspection
Hour-scale check that affects services on nodes. This will occupy NPUs for a long time. Ensure that no service is running in the cluster during the inspection.
Load Test Case Configuration
NPU Performance Diagnosis
Perform performance diagnosis by bandwidth, AI FLOPs, or eye pattern test. You can select one or more options for diagnosis.
- BandWidth: Diagnose the local bandwidth.
- Aiflops: Diagnose the compute of chips.
- Eye Diagram Test: Query the detailed data of the signal quality.
NPU stress testing
Perform the stress test on the AI Core, HBM, and P2P.
- AI Core stress test: Perform a pressure test on AI Core errors.
- HBM stress test: Perform a stress test on high-bandwidth memory.
- P2P stress test: Check whether a hardware fault occurs on the HCCS communication link from the source device to the target device.
Network Load Testing
Perform the single-node HCCL communication bandwidth test, multi-node HCCL bandwidth test, and RDMA communication bandwidth test.
- Single-node HCCL communication bandwidth test: Perform the collective communication performance pressure test between a single compute node.
- Multi-node HCCL bandwidth test: Perform the baseline collective communication performance pressure test between multiple compute nodes.
- RoCE network bandwidth test: The system tests the RoCE network's bandwidth between two nodes.
- Hyperplane test: tests the collective communication bandwidth of the hyperplane network.
- Read the term of use, enter YES, and click Create now.
- After the inspection job is submitted, go to the ModelArts console. In the navigation pane on the left, choose Health Inspection under O&M Management. View the inspection job status and details.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot