Updated on 2025-08-18 GMT+08:00

Authorizing the Repair of Lite Server Nodes

Scenario

If hardware maintenance is required for a Lite Server node due to an unrecoverable fault, a scheduled event will be pushed to the event center of the console. In the event center, you can view the event information, type, status, and description. You can also authorize Huawei technical support to perform O&M on the faulty node or redeploy the node.

Table 1 Event operation execution conditions

Event Type

Event Status

Supported Operations

Applicable Resource Type

Description

System maintenance

Authorization Pending

Authorization and redeployment

Snt9b

System maintenance is to authorize Huawei technical support to systematically maintain the faulty node.

Local disk recovery

Authorization Pending

Authorization and redeployment

Snt9b

Local disk recovery is to authorize Huawei technical support to maintain the faulty local disk.

WARNING:

After authorization, recovering the local disk will cause local supernode disk loss. Therefore, migrate services and back up data before authorization.

Supernode maintenance

Authorization Pending

Authorization

Snt9b23

Supernode maintenance is to authorize Huawei technical support to recover faulty nodes by manually repairing or replacing components.

Supernode redeployment

Authorization Pending

Authorization

Snt9b23

Supernode redeployment is to authorize the Huawei O&M system to recover faulty nodes by automatically replacing nodes. After the recovery, the node name, node ID, and IP address remain unchanged except the physical device information.

Supernode local disk recovery

Authorization Pending

Authorization

Snt9b23

Supernode local disk recovery is to authorize Huawei technical support to restore the local disk of the supernode.

WARNING:

After authorization, recovering the local disk will cause local supernode disk loss. Therefore, migrate services and back up data before authorization.

  • Authorization: Authorize Huawei technical support to repair the hardware of the faulty node one by one, which takes a long time.
  • Redeployment: Authorize Huawei technical support to replace the faulty node with a new one, which is fast, but local disk data will be lost after the redeployment. Exercise caution. Moreover, migrate services and back up data before redeployment.

Constraints

  • Only Ascend Snt9b and Snt9b23 support hardware maintenance through scheduled events.
  • Redeployment of supernodes must be performed within physical supernodes. If there are 48 supernodes, redeployment is not supported and the authorization button becomes unavailable.
  • If the planned event does not meet the requirements listed in Table 1, the Authorize button becomes unavailable.
  • Before authorizing a supernode redeployment event, you need to stop the server instance on the Lite Servers page. Otherwise, the authorization fails. After the event is executed, restart the server instance.
  • Authorizing a node will affect services running on it. The authorization operation can be performed only when the event type is Supernode Redeployment and the node is shut down.
  • After the local node disk and supernode disk are restored, the local disk data will be lost. Therefore, migrate services and backup data before authorization. After the local disk is restored, log in to the Lite Server node to partition the local disk.

Viewing Scheduled Events

Log in to the ModelArts console. In the navigation pane on the left, click Event Center under Resource Management. You can view the event details on the displayed page. By default, events in the Authorization Pending, Authorized, and Executing states are displayed. You can remove the filter criteria to view events in all states.

Table 2 Scheduled event description

Attribute

Description

Example

Event ID

Unique event ID.

5ad1df12-e3d2-4f36-b367-xxxxxxxxxxxx

Node Name/ID

Name and ID of the server node that initiates the event.

devserver-dd50

1e0d95ad-5a9f-46e3-9ba6-c5f8fcxxxx

Event Type

For details about the event types, see Table 1.

Supernode Redeployment

Event Status

  • Authorization Pending: Querying. The event is to be authorized. After authorization, the status changes to Authorized.
  • Authorized: The O&M task is planned to be executed but has not started. After the task starts, the task enters the Executing state.
  • Executing: The O&M task is being executed.
  • Completed: The O&M task has been executed.
  • Failed: The O&M task fails to be executed.
  • Canceled: The system cancels the O&M task.

Authorization Pending

Event Description

Cause of the event.

Underlying hardware fault. alarmName=XXXX,bmcip=2409:27ff:1003:0103:0011:0000:0000:xxxx,componentName=XXXX is automatically connected through CAR.

Obtained At

Event creation time

2025/02/19 16:05:32 GMT+08:00

Executed

Time when an event enters the scheduling and execution phase

2025/03/03 16:23:16 GMT+08:00

Operation

Authorize: Authorizing a node will affect services running on it. The authorization operation can be performed only when the event type is Supernode Redeployment and the node is shut down.

NOTE:

Redeployment of supernodes must be performed within physical supernodes. If there are 48 supernodes, redeployment is not supported and the authorization button becomes unavailable.

--

Authorization Operations

If the faulty nodes meet the requirements listed in Table 1, you can authorize Huawei technical support to perform O&M on the faulty nodes.

To do so, log in to the ModelArts console. In the navigation pane on the left, choose Event Center. Locate the target node and click Authorize in the Operation column. In the displayed dialog box, click OK. The following steps describe how to authorize Huawei technical support to perform O&M on a supernode.

  1. Log in to the ModelArts console. In the navigation pane on the left, choose Event Center. On the displayed Event Center page, view events whose Event Type is Supernode maintenance and click Authorize.
  2. The supernode maintenance event enters the Authorized state.
  3. After the supernode is repaired, the event status is Completed. In this case, the node is available.

    After the O&M, Huawei technical support will disable the authorization. You do not need perform any operation.

    For local disk and supernode disk restoration, you need to log in to the Lite Server node to partition the local disk afterwards.

Redeployment Operations

If the faulty node meets the redeployment conditions described in Table 1, log in to the ModelArts console. In the navigation pane on the left, choose Event Center under Resource Management. Locate the target node and click Redeploy in the Operation column. In the displayed dialog box, enter YES and click OK.

After the redeployment, the data on the local disk will be lost. Exercise caution. Migrate services and back up data before redeployment.

If the planned event does not meet the requirements listed in Table 2, the Redeploy button becomes unavailable.

After the O&M, Huawei technical support will disable the authorization. You do not need perform any operation.