Upgrading the Standard Dedicated Resource Pool Driver

Description

If there are GPU/NPU resources in a dedicated resource pool node, and the node performance cannot meet your requirements, you can upgrade the driver to resolve known issues, improve performance, or support new functions, ensuring resource pool performance and compatibility.

ModelArts allows you to upgrade the GPU/NPU driver of a dedicated resource pool on the ModelArts console as required.

Secure Upgrade and Forcible Upgrade

There are two driver upgrade modes: secure upgrade and forcible upgrade. The following table describes the comparisons.

**Table 1** Secure upgrade and forcible upgrade
Item	Secure Upgrade	Forcible Upgrade
Introduction	Upgrade the driver when the node is idle, which does not affect running tasks. The smooth upgrade reduces the impact on services. After the upgrade starts, the nodes will be isolated (new jobs cannot be delivered). Only after the existing jobs on the node are complete will the upgrade be performed. This may take a rather long time as jobs must be completed first.	Running tasks on the node will be ignored and the driver will be directly upgraded. This upgrade mode is fast as you do not need to wait until the node is idle.
Scenario	Non-urgent and gradual upgrade	Urgent and fast upgrade
Notes	The upgrade period is rather long as nodes must be idle first. Before the upgrade, plan the node idle time to reduce the impact on services.	Running tasks may be interrupted or fail. Exercise caution when using this mode.

Constraints

The target dedicated resource pool must be running, and the resource pool contains GPU or NPU resources.
For logical resource pools and logical subpools, you need to enable node binding for driver upgrade.
When you upgrade the driver of a standard dedicated resource pool, it will not affect nodes bound to a logical subpool that has upgraded its driver. To upgrade the drivers of these nodes, update the driver of the logical subpool. To upgrade the entire physical pool, disable node binding for the logical pool.
To upgrade a resource pool, the nodes in it need to be restarted. Perform the upgrade during off-peak hours so that running tasks will not be affected. Check the resource pool node status on the resource pool details page by referring to Viewing Resource Pool Nodes.

Upgrading the GPU/NPU Driver in a Dedicated Resource Pool

Log in to the ModelArts console. In the navigation pane on the left, choose Standard Cluster under Resource Management.
Locate the target resource pool in the list and choose > Upgrade Driver in the Operation column.

In the displayed dialog box, you can view the driver type, number of instances, current version, target version, upgrade mode, upgrade scope, and rolling switch of the dedicated resource pool. Set the parameters by referring to Table 2.

**Table 2** Parameters for upgrading a driver
Parameter	Description
Target Version	Choose the target version from the drop-down list. The driver of the added nodes may not be that of the existing nodes. Select the current driver version for Target Version. After the upgrade, all nodes will be upgraded to the same version
Upgrade Mode	Select Secure upgrade or Forcible upgrade. For details about the differences, see Secure Upgrade and Forcible Upgrade. Secure upgrade: Perform the upgrade when no job is running on the node. The upgrade may take a long time. Forcible upgrade: Ignore the running jobs and perform the upgrade directly. This may cause the running jobs to fail.
Rolling	Once enabled, you can upgrade the driver in rolling mode. Rolling upgrade is a gradual instance replacement method that applies to scenarios where service continuity is required. Instances are upgraded in batches to ensure that some instances are running properly during the upgrade, reducing the downtime. During the rolling upgrade, the nodes with abnormal drivers do not affect the upgrade and will also be upgraded.
Rolling Mode	Currently, By node percentage and By instance quantity are supported. By node percentage: The number of instances to be upgraded is the percentage multiplied by the total number of instances in the resource pool. By instance quantity: The number of instances to be upgraded is the value of this parameter. For different upgrade modes, the policies for upgrading nodes are different. If Secure upgrade is selected, the instances without services are upgraded. To check whether a node has any service, go to the resource pool details page. In the Nodes tab, check whether all GPUs and NPUs are available. If yes, the node has no services. If Forcible upgrade is selected, random instances are upgraded.
Node Percentage	Set this parameter if Rolling Mode is set to By node percentage. The number of instances to be upgraded in each batch = The value of By node percentage x Total number of instances in the resource pool.
Instances	Set this parameter if Rolling Mode is set to By instance quantity.

Figure 1 Upgrading a driver

Click OK to start the driver upgrade.
In the resource pool list, locate the target resource pool, and choose More > Upgrade Driver. On the displayed page, check whether the current version is the target version. If yes, the driver is upgraded.

FAQ

How do I locate a faulty node in a standard resource pool?

In a standard resource pool, ModelArts will add a taint to a faulty Kubernetes node so that jobs will not be scheduled to the tainted node. For details, see Faulty Nodes in a Standard Resource Pool.

Parent topic: Managing Standard Dedicated Resource Pools

Previous topic: Resizing a Standard Dedicated Resource Pool

Next topic: Rectifying a Faulty Node in a Standard Dedicated Resource Pool