Updated on 2026-06-01 GMT+08:00

Performing Health Check on Lite Servers

Description

During routine O&M, O&M engineers may face many challenges in pressure tests, health checks, fault detection, and log analysis. To improve O&M efficiency, Lite Servers interconnect with AI Compute Service Brain to comprehensively inspect and manage Lite Servers. You can directly create a health check job on the Lite Servers page. In this way, O&M engineers can access the AI Compute Service Brain page, and create and execute a health check job, improving system O&M capabilities and reliability.

Constraints

  • Currently, only Snt9b and Snt9b21 common nodes and Snt9b23 supernodes are supported.
  • The Lite Server node is in the Running state.

Prerequisites

The NodeTaskHub plugin has been installed for the Lite Server on which you want to create a health check task. For details, see Installing the AI Plugin for a Lite Server.

Creating an Inspection Job

  1. Log in to the ModelArts console. In the navigation pane, choose Resource Management > Lite Servers.
    • New console: In the navigation pane, choose Resource Management > Lite Compute Resources > Lite Servers.
    • Old console: In the navigation pane, choose Resource Management > Lite Servers.
  2. In the Common node list, choose More > Create Health Check Task in the Operation column on the right. On the displayed page, configure parameters.
    Table 1 Parameters for creating a health check task

    Configuration Item

    Parameter

    Description

    Basic Configuration

    Assignment Name

    Enter a custom inspection task name.

    Inspection Object

    Select Object

    You can select common nodes, supernodes, or an entire rack of nodes. Select the target nodes in the list. You can select at most 48 nodes.

    Inspection Type

    Standard Inspection

    Minute-scale check that does not affect jobs on nodes.

    Deep Inspection

    Hour-scale check that affects services on nodes. This will occupy NPUs for a long time. Ensure that no service is running in the cluster during the inspection.

    Load Test Case Configuration

    NPU Performance Diagnosis

    Perform performance diagnosis by bandwidth, AI FLOPs, or eye pattern test. You can select one or more options for diagnosis.

    • BandWidth: Diagnose the local bandwidth.
    • Aiflops: Diagnose the compute of chips.
    • Eye Diagram Test: Query the detailed data of the signal quality.

    NPU stress testing

    Perform the stress test on the AI Core, HBM, and P2P.

    • AI Core stress test: Perform a pressure test on AI Core errors.
    • HBM stress test: Perform a stress test on high-bandwidth memory.
    • P2P stress test: Check whether a hardware fault occurs on the HCCS communication link from the source device to the target device.

    Network Load Testing

    Perform the single-node HCCL communication bandwidth test, multi-node HCCL bandwidth test, and RDMA communication bandwidth test.

    • Single-node HCCL communication bandwidth test: Perform the collective communication performance pressure test between a single compute node.
    • Multi-node HCCL bandwidth test: Perform the baseline collective communication performance pressure test between multiple compute nodes.
    • RoCE network bandwidth test: The system tests the RoCE network's bandwidth between two nodes.
    • Hyperplane test: tests the collective communication bandwidth of the hyperplane network.
  3. Read the term of use, enter YES, and click Create now.
  4. After the inspection job is submitted, go to the ModelArts console. In the navigation pane on the left, choose Health Inspection under O&M Management. View the inspection job status and details.