Updated on 2025-08-18 GMT+08:00

Rectifying a Faulty Node in a Standard Dedicated Resource Pool

This section describes how to rectify the faulty nodes. Currently, you can reset and replace the node, as well as configuring an HA redundant node. During fault locating and performance diagnosis, certain O&M operations need to be authorized by users.

HA Redundant Node

  • Function

    An HA redundant node is the backup node in a dedicated resource pool. It can automatically replace a faulty common node to improve the overall availability and fault tolerance capability of the resource pool, preventing service loss caused by the failure of a single node and ensuring service continuity and stability. You can set the number of HA instances based on your requirements.

    HA redundant nodes cannot be used for service running, which will affect the number of actual available instances in the resource pool. When a resource pool delivers a task, select the number of actual available instances with caution. If the HA redundant instances are included, the task will fail to be delivered.

  • Mechanism
    • HA redundant nodes are isolated and cannot be scheduled by default. Workloads cannot be scheduled to the nodes.
    • HA redundant nodes function as standby nodes and work with node fault detection to ensure that the faulty nodes are automatically replaced by normal nodes in a resource pool. The replacement normally takes only a few minutes. After the replacement, the original HA redundant nodes will be de-isolated and become normal nodes, and the faulty nodes will be labeled as HA redundant nodes. Rectify the fault for future automatic switchover. After the faulty nodes are rectified, they become the new HA redundant nodes.

      HA redundant nodes can free you from paying attention to node status and reduce O&M costs. However, you need to purchase backup nodes as HA redundant nodes. The resource costs are higher.

      If a logical sub-pool is created in a Standard physical pool and the sub-pool is transparent, HA redundancy is no longer available in the Standard physical pool and is only available in the sub-pool. The transparency is a white-list function. Submit a service ticket to use it.

Setting HA nodes: You can set multiple HA redundant nodes in batches for a resource pool or set a single node as an HA redundant node.

5% of the total nodes in a resource pool should be HA redundant nodes. For example, there are 20 nodes in a resource pool, then there should be one HA redundant node.

If a service node is set as an HA redundant node, the jobs on the node still occupy the actual available resources of the resource pool. However, HA redundant nodes are not counted in the actual available resources, and the result may affect the node scheduling of the actual available resources. Note that the actual available resources of a resource pool are the total available resources of the resource pool minus the resources occupied by HA redundant nodes. When setting HA redundant nodes, use service-free nodes.

To check whether a node has any service, go to the resource pool details page. In the Nodes tab, check whether all GPUs and Ascend chips are available. If yes, the node has no services.

  • Setting multiple HA redundant nodes in batches for a resource pool
    • Method 1: Set when purchasing the license (supported only by Snt9C).

      Parameters:

      • Enable HA redundancy: Whether to enable HA redundancy for the resource pool. By default, HA redundancy is enabled for supernodes.
      • Redundant node distribution policy: indicates the distribution policy of redundant nodes. Supernodes support only step-based even distribution. The same number of redundant nodes are reserved in each supernode.
      • Number of redundant instances: number of HA redundant instances set for this flavor. The redundancy coefficient refers to a quantity of redundant nodes reserved in each supernode when the redundant instance distribution policy is step-based even distribution.
    • Method 2: Set on the resource pool details page.
      Figure 1 Setting in the Specifications tab
    • Method 3: Set on the Scaling page.
  • Configure a single node as HA redundant node.
    • Enabling HA redundancy

      Select a node without any service as the HA redundant node. On the resource pool details page, click the Nodes tab, locate the target node, and click More > Enable HA Redundancy in the Operation column.

      If you need to enable HA redundancy for nodes in batches, select the target nodes, and click Enable HA Redundancy above the list.

      Figure 2 Enabling HA redundancy

      Figure 3 HA redundant node
    • Disabling HA redundancy

      On the resource pool details page, click the Nodes tab. Locate the target node and click More > Disable HA Redundancy in the Operation column.

      Once disabled, the nodes will be de-isolated and no longer used as standby nodes. Workloads can be scheduled to the nodes properly.

      If you need to disable HA redundancy for nodes in batches, select the target nodes, and click Disable HA Redundancy above the list.

      Figure 4 Disabling HA redundancy
      Figure 5 Non-HA redundancy

Resetting a Node

Perform this operation to upgrade the OS of the node. If an error occurs during node configuration update, reset the node to rectify the fault.

To reset a single node, access the Nodes tab, locate the target node in the list and click Reset in the Operation column. You can also select multiple nodes, and click Reset to reset multiple nodes.

NPU nodes whose instance specifications are snt9b23 cannot be reset.

Configure the parameters described in the table below.

Table 1 Parameters for resetting a node

Parameter

Description

Operating system

Select an OS from the drop-down list.

Configuration Mode

Select a configuration mode for resetting the node.

  • By node percentage: the maximum percentage of nodes that can be reset at a time
  • By instance quantity: the maximum number of nodes that can be reset at a time
Figure 6 Resetting a node

Check the node reset records on the Records page. If the node is being reset, its status is Resetting. After the reset is complete, the node status changes to Available. Resetting a node will not be charged.

  • Resetting a node will affect running services.
  • Only nodes in the Available or Unavailable state can be reset.
  • A single node can be in only one reset task at a time. Multiple reset tasks cannot be delivered to the same node at a time.
  • If there are any nodes in the Replacing state in the operation records, nodes in the resource pool cannot be reset.
  • When the driver of a resource pool is being upgraded, nodes in this resource pool cannot be reset.
  • After the node is reset, it will be unavailable for a short time. The service is being started and the health status is being checked. Wait patiently.
  • For GPU and NPU specifications, after the node is reset, the driver of the node may be upgraded. Wait patiently.
Figure 7 Operation records

Restarting a Node

On the resource pool details page, click the Nodes tab, you can restart a node. To restart a single node, locate the target node, and click More > Reboot in the Operation column. You can also select multiple nodes and click Reboot above the list to restart multiple nodes.

When delivering a node restart task, select the corresponding nodes. Restarting a node will affect running services. Exercise caution.

Figure 8 Restarting a node

Check the node operation records in the Records dialog box. If the node is being restarted, its status is Rebooting. After it is restarted, the status changes to Available. Restarting a node will not be charged.

Figure 9 Operation records
  • Restarting a node will affect running services.
  • Only nodes in the Available or Unavailable state can be restarted.
  • A single node can be in only one restart task at a time. Multiple restart tasks cannot be delivered to the same node at a time.
  • If the node is being replaced, reset, or deleted, it cannot be restarted.
  • When the driver of a resource pool is being upgraded, nodes in this resource pool cannot be restarted.
  • After the node is restarted, it will be unavailable for a short time. The service is being started and the health status is being checked. Wait patiently.

Migrating a Node

Pay-per-use nodes can be migrated between resource pools.

To migrate nodes, the following requirements must be met:

  • There must be at least one node left in the original resource pool after node migration.
  • If the original resource pool has a logical subpool, the left resources in the original resource pool must meet the resource requirements of the logical subpool.
  • There must be abundant available IP addresses in the target resource pool. Check the number of available IP addresses.

To migrate a node, perform the following steps:

  1. In the Nodes tab, select the target nodes and choose More > Migrate Node above the node list.
  2. Click Select Resource Pool on the displayed page.
    Figure 10 Selecting the target resource pool
  3. In the Select Resource Pool dialog box, select the target resource pool to which the nodes are to be migrated and click OK.
  4. Select the target node pool and click Next: Confirm Specifications.
    Figure 11 Selecting the target node pool
  5. Confirm the node migration information and click Submit.
    Figure 12 Confirming the node migration information

    In the upper right corner of the node list, click Records to view the operation records of node migration in the current resource pool.

Deleting or Unsubscribing from a Node

  • To release a single node from a pay-per-use resource pool, locate the target node and click Delete in the Operation column, enter DELETE in the displayed dialog box, and click OK.

    To delete nodes in batches, select the target nodes and click Delete above the node list, enter DELETE in the displayed dialog box, and click OK.

  • For a yearly/monthly resource pool whose resources are not expired, click Unsubscribe in the Operation column.
  • For a yearly/monthly resource pool whose resources are expired (in the grace period), click Release in the Operation column.

If the delete button is available for a yearly/monthly node, click the button to delete the node.

  • Before deleting, unsubscribing from, or releasing a node, ensure that there are no running jobs on this node. Otherwise, the jobs will be interrupted.
  • Delete, unsubscribe from, or release abnormal nodes in a resource pool and add new ones for substitution.
  • If there is only one node, it cannot be deleted, unsubscribed from, or released.

Enabling/Disabling the Deletion Lock

To prevent nodes from being deleted or unsubscribed by mistake, you can enable the deletion lock. Once enabled, the nodes cannot be deleted or unsubscribed unless the lock is disabled.

  • The deletion lock can be enabled only for the nodes in the resource pool.
  • If the deletion lock is enabled, only node deletion and unsubscription are restricted. Other operations, such as node replacement, node restart, and node reset, work properly. Moreover, the resource pool that contains the nodes with deletion lock enabled can be deleted.
  • Enabling deletion lock: Locate the target node and choose More > Enable Deletion Lock in the Operation column. In the displayed dialog box, confirm the information, enter YES in the text box, and click OK.

    To enable deletion lock for multiple nodes in batches, select the target nodes and choose More > Enable Deletion Lock above the node list.

  • Disabling deletion lock: Locate the target node and choose More > Disable Deletion Lock in the Operation column. In the displayed dialog box, confirm the information, enter YES in the text box, and click OK.

    To disable deletion lock for multiple nodes in batches, select the target nodes and choose More > Disable Deletion Lock above the node list.

Authorizing O&M on the Event Center Page

To view the faulty nodes reported by the ModelArts O&M platform, log in to the ModelArts console. In the navigation pane on the left, choose Event Center. The planned events of the faulty nodes are displayed, including the basic information, event type, event status, and event description. You can either redeploy the nodes or authorize Huawei technical support to perform O&M operations.

  • Authorization conditions

    Table 2 lists the event types and event status of the authorization operations that can be performed on the faulty node.

    Table 2 Authorization conditions

    Event Type

    Event Status

    Authorization Operations

    System maintenance

    Authorization Pending

    Authorization and redeployment

    Local disk recovery

    Authorization Pending

    Authorization and redeployment

    After the local disk is recovered, you can restore the partition by resetting the node.

    WARNING:

    After authorization, recovering the local disk will cause local disk data loss. Therefore, migrate services and back up data before authorization.

    Restarting a node

    Authorization Pending

    Authorization

    O&M authorization

    Authorization Pending

    Authorization

    Supernode maintenance

    Authorization Pending

    Authorization

    Supernode redeployment

    Authorization Pending

    Redeployment

    Redeployment of supernodes must be performed within physical supernodes. When supernodes are sold out, redeployment is not supported, and the authorization button becomes unavailable.

    Supernode local disk recovery

    Authorization Pending

    Authorization

    WARNING:

    After authorization, recovering the local disk will cause local disk data loss. Therefore, migrate services and back up data before authorization.

  • Authorization

    If the faulty nodes meet the requirements listed in Table 2, you can authorize Huawei technical support to perform O&M on the faulty nodes.

    To do so, log in to the ModelArts console. In the navigation pane on the left, choose Event Center. Locate the target node and click Authorize in the Operation column. In the displayed dialog box, click OK.

    If the planned event does not meet the requirements listed in Table 2, the Authorize button becomes unavailable.

    After the O&M, Huawei technical support will disable the authorization. No further operations are required.

  • Redeployment

    If the faulty nodes meet the redeployment requirements listed in Table 2, you can authorize Huawei technical support to redeploy the faulty nodes.

    After the O&M, Huawei technical support will disable the authorization. No further operations are required.

    Redeploying nodes can restore them quickly, but local disk data will be lost. Therefore, migrate services and back up data before redeployment.

    1. To redeploy a node, log in to the ModelArts console. In the navigation pane on the left, choose Event Center under Resource Management, locate the node, and click Redeploy in the Operation column.

      If the planned event does not meet the requirements listed in Table 2, the Redeploy button becomes unavailable.

    2. Check whether Forcible redeployment is selected, enter YES in the text box, and click OK.

      Redeployment depends on the node status. If the node is unavailable, redeployment cannot be completed. However, you can select Forcible redeployment to forcibly redeploy the node.

      Forcible redeployment resets the node, deleting all data on both its local and cloud disks. Exercise caution when performing this operation.

FAQ

How do I locate a faulty node in a standard resource pool?

In a standard resource pool, ModelArts will add a taint to a faulty Kubernetes node so that jobs will not be scheduled to the tainted node. You can locate the fault by referring to the isolation code and detection method. For details, see Faulty Nodes in a Standard Resource Pool.