Setting and Enabling HA Redundant Nodes in a Lite Cluster Resource Pool

Scenario

If the service continues running with high concurrent traffic, once the node that carries the service is faulty, the service may be interrupted if no backup mechanism is available, causing great loss and impact. In this case, a guarantee mechanism is required to ensure stable service running.

An HA redundant node is a standby node used to ensure high service availability on ModelArts. In this way, when a service node is faulty, the HA redundant node takes over the service and allocates resources dynamically based on service traffic, ensuring service continuity and stability.

HA redundant nodes are provided in ModelArts Standard resource pools. For details, see HA Redundant Node. However, HA redundant nodes cannot be directly configured in a Lite Cluster. This section describes how to manually configure HA redundant nodes in a ModelArts Lite Cluster resource pool. When a node is faulty, the HA redundant nodes can be used to quickly restore services without waiting for the faulty node to recover.

Figure 1 HA redundant node
Click to enlarge

The following figure shows the overall process.

Figure 2 Setting and enabling HA redundant nodes
Click to enlarge

Step 1: Setting HA Redundant Nodes: Set HA redundant nodes by adding specific taints to nodes.

Step 2: Configuring Node Alarm Notifications to Detect Faulty Nodes: Configure node alarm notifications to detect node faults.

Step 3: Replacing a Faulty Node with an HA Redundant Node: Add a fault taint to the faulty node and drain the node to clear the tasks on the faulty node. Meanwhile, specific taints of the HA redundant node are deleted, and the HA redundant node is officially enabled.

Step 4: Converting the Faulty Node to a New HA Redundant Node: The faulty node repaired by Huawei Cloud becomes a new HA redundant node.

Billing

Like common nodes, HA redundant nodes are charged for compute resources. For Lite Cluster resource pools, only the yearly/monthly billing mode is supported. For details, see Table 1.

**Table 1** Billing items
Billing Item		Description	Billing Mode	Billing Formula
Compute resource	Dedicated resource pool	Usage of compute resources. For details, see ModelArts Pricing Details.	Yearly/Monthly	Specification unit price x Number of compute nodes x Purchase duration

Prerequisites

You have created a Lite Cluster resource pool. For details, see Enabling Lite Cluster Resources.

Step 1: Setting HA Redundant Nodes

Before using a Lite Cluster resource pool, add specific taints to nodes to set HA redundant nodes.

Log in to the ModelArts console. In the navigation pane on the left, choose Lite Cluster under Resource Management.
Click the resource pool name. The Basic Information tab is displayed.
Click the CCE cluster. The CCE cluster node management page is displayed.
Figure 3 Basic pool information
Locate the selected idle node and choose More > Manage Taint.
In the displayed dialog box, add taints key=backupNode and effect=NoSchedule to the node and click OK.

Step 2: Configuring Node Alarm Notifications to Detect Faulty Nodes

Configure node alarm notifications to detect node faults.

By default, the node fault metric (nt_npg) is reported to AOM. You can configure notifications such as the SMS or email on AOM.

After a faulty node is detected, log in to the ModelArts console. In the navigation pane on the left, choose Event Center under Resource Management. On the displayed page, view the planned events of the node and authorize Huawei Cloud to repair the node. For details, see Managing Lite Cluster Nodes.

The following steps are performed on AOM 1.0.

Log in to the AOM console.
In the navigation pane, choose Alarm Center > Alarm Rules. Then, click Create Alarm Rule.
Set an alarm rule. (NPU disconnection is used as an example.)
- Rule Type: Select Metric alarm rule.
- Configuration Mode: Select PromQL.
- Default Rule: Select Custom and enter the following information in the command input box:
```
sum(nt_npg{type="NT_NPU_CARD_LOSE"} !=2) by (cluster_name, node_ip,type)
```
- Alarm Rule Details > Duration: Select 1 minute, which triggers a major alarm when the rule's conditions persist continuously for one minute.
- (Optional) Alarm Notification: To get notified of alarms by email or SMS message, configure action rules for the alarm rule. If no action rule is available, you can create one.

Step 3: Replacing a Faulty Node with an HA Redundant Node

Add a fault taint to the faulty node and drain the node to clear the tasks on the faulty node. Meanwhile, specific taints of the HA redundant node are deleted, and the HA redundant node is officially enabled.

Add a fault taint to the faulty node.
Set the taint by referring to Step 1: Setting HA Redundant Nodes (key=faultyNode and effect=NoSchedule).
On the CCE cluster node management page, locate the HA redundant node for which a taint has been configured in Step 1: Setting HA Redundant Nodes, and click Manage Taint.
In the displayed dialog box, locate the taint record whose key is backupNode, click Delete, and then click OK.
On the CCE cluster node management page, locate the faulty node, and choose More > Node Drainage.
On the displayed page, select Forcible Drain. When you set drainage, the system automatically sets the node to unschedulable and marks the node with a taint whose key is node.kubernetes.io/unschedulable.
Drain the tasks on the faulty node tainted in step 1 and reschedule the affected tasks.

After the draining is complete, a message is displayed in the node status on the node management page, indicating that the draining is successful.
After the draining is complete, locate the faulty node and choose More > Enable Scheduling on the right. In the displayed dialog box, click Yes. Cancel the automatic unscheduling settings. The taint whose key is node.kubernetes.io/unschedulable is automatically removed.

Step 4: Converting the Faulty Node to a New HA Redundant Node

The faulty node repaired by Huawei Cloud becomes a new HA redundant node.

To view the repair status of the node, log in to the ModelArts console, and choose Event Center under Resource Management. If the event status is Completed, the repair is complete. For details, see Authorizing O&M on the Event Center Page.

Add taints key=backupNode and effect=NoSchedule to the restored faulty node and remove the taint key=faultyNode. For details, see Step 1: Setting HA Redundant Nodes.

Parent topic: Using Lite Cluster Resources

Previous topic: Mounting an SFS Turbo File System to a Lite Cluster

Next topic: Cross-Region Access from Lite Cluster to Other Services