Updated on 2025-08-20 GMT+08:00

Upgrading the Lite Cluster Resource Pool Driver

Scenario

If there are GPU/Ascend resources in a Lite Cluster resource pool node, and the node performance cannot meet your requirements, you can upgrade the driver to resolve known issues, improve performance, or support new functions, ensuring resource pool performance and compatibility.

ModelArts allows you to upgrade the GPU/Ascend driver of a Lite Cluster resource pool on the ModelArts console as required.

Secure Upgrade and Forcible Upgrade

There are two driver upgrade modes: secure upgrade and forcible upgrade. The following table describes the comparisons.

Table 1 Secure upgrade and forcible upgrade

Item

Secure Upgrade

Forcible Upgrade

Introduction

Upgrade the driver when the node is idle, which does not affect running tasks. The smooth upgrade reduces the impact on services.

After the upgrade starts, the nodes will be isolated (new jobs cannot be delivered). Only after the existing jobs on the node are complete will the upgrade be performed. This may take a rather long time as existing jobs must be completed first.

Running tasks on the node will be ignored and the driver will be directly upgraded.

This upgrade mode is fast as you do not need to wait until the node is idle.

Scenario

Non-urgent and gradual upgrade

Urgent and fast upgrade

Precautions

The upgrade period is rather long as nodes must be idle first. Before the upgrade, plan the node idle time to reduce the impact on services.

Running tasks may be interrupted or fail. Exercise caution when using this mode.

Notes and Constraints

  • The target Lite Cluster resource pool must be running and contains GPU or Ascend resources.
  • To perform the upgrade, you need to restart the node, which is recommended to be performed during off-peak hours to avoid affecting running tasks. You can view the node usage on the Node Management page of the resource pool details page.

    Upgrading the driver will restart the node, which may result in the loss of any customized configurations made on the host.

Upgrading the GPU/Ascend Driver in a Lite Cluster Resource Pool

  1. Log in to the ModelArts console. In the navigation pane on the left, choose Lite Cluster under Resource Management. In the resource pool list, locate the target resource pool, and choose > Upgrade Driver.

    Alternatively, click the resource pool name in the list to access its details page. In the navigation pane on the left, choose Node Pool Management. Locate the target node pool and choose More > Upgrade Driver in the Operation column.

  2. In the displayed dialog box, you can view the driver type, number of instances, current version, target version, upgrade mode, upgrade scope, and rolling switch of the Lite Cluster resource pool. Set the parameters by referring to Table 2.
    Table 2 Parameters

    Parameter

    Description

    Target Version

    Choose the target version from the drop-down list.

    The driver of the added nodes may not be that of the existing nodes. Select the current driver version for Target Version. After the upgrade, all nodes will be upgraded to the same version

    Upgrade Mode

    Select Secure upgrade or Forcible upgrade. For details about the differences, see Secure Upgrade and Forcible Upgrade.

    • Secure upgrade: Perform the upgrade when no job is running on the node. The upgrade may take a long time.
    • Forcible upgrade: Ignore the running jobs and perform the upgrade directly. This may cause the running jobs to fail.

    Rolling

    Once enabled, you can upgrade the driver in rolling mode.

    Rolling upgrade is a gradual instance replacement method that applies to scenarios where service continuity is required. Instances are upgraded in batches to ensure that some instances are running properly during the upgrade, reducing the downtime.

    Nodes with abnormal drivers will be upgraded during a rolling upgrade, just like other nodes.

    Rolling Mode

    Currently, By node percentage and By instance quantity are supported.

    • By node percentage: The number of instances to be upgraded is the percentage multiplied by the total number of instances in the resource pool.
    • By instance quantity: The number of instances to be upgraded is the value of this parameter.

    For different upgrade modes, the policies for upgrading nodes are different.

    • If Secure upgrade is selected, the instances without services are upgraded.

      To check whether a node has any service, go to the resource pool details page. In the Nodes tab, check whether all GPUs and Ascend chips are available. If yes, the node has no services.

    • If Forcible upgrade is selected, random instances are upgraded.

    Node Percentage

    Set this parameter if Rolling Mode is set to By node percentage. The number of instances to be upgraded in each batch = The value of By node percentage x Total number of instances in the resource pool.

    Instances

    Set this parameter if Rolling Mode is set to By instance quantity.

    Figure 1 Upgrading a driver

  3. Click OK to start the driver upgrade.

    In the resource pool list, locate the target resource pool, and choose > Upgrade Driver. On the displayed page, check whether the current version is the target version. If yes, the driver is upgraded.