Help Center/ ModelArts/ Resource Management/ Elastic Cluster/ Upgrading a Resource Pool Driver
Updated on 2024-05-08 GMT+08:00

Upgrading a Resource Pool Driver

Description

If GPUs or Ascend resources are used in a dedicated resource pool, you may need to customize GPU or Ascend drivers. ModelArts allows you to upgrade GPU or Ascend drivers of your dedicated resource pools.

There are two driver upgrade modes: secure upgrade and forcible upgrade.

  • Secure upgrade: Running services are not affected. After the upgrade starts, the nodes are isolated (new jobs cannot be delivered). After the existing jobs on the nodes are complete, the upgrade is performed. The secure upgrade may take a long time because the jobs must be completed first.
  • Forcible upgrade: The drivers are directly upgraded, regardless of whether there are running jobs.

Constraints

  • The target dedicated resource pool must be running, and the resource pool contains GPU or Ascend resources.
  • For a logical resource pool, the driver can be upgraded only after node binding is enabled. To enable node binding, submit a service ticket to contact Huawei engineers.

Upgrading the Driver

  1. Log in to the ModelArts management console. In the navigation pane, choose Dedicated Resource Pools > Elastic Cluster.
  2. In the Operation column of the target resource pool, choose More > Upgrade Driver.
  3. In the Upgrade Driver dialog box, the driver type, number of nodes, current version, target version, and upgrade mode of the dedicated resource pool are displayed.
    • Target Version: Select a target driver version from the drop-down list.
    • Upgrade Mode: Select Secure upgrade or Forcible upgrade.
    • Rolling Mode: Once enabled, you can upgrade the driver in rolling mode. Currently, rolling by node percentage and by node quantity are supported. If By node percentage is selected, the number of nodes to be upgraded in each batch is the node ratio multiplied by the total number of nodes in the resource pool. If By node quantity is selected, the number of nodes to be upgraded in each batch is what you configured.
    Figure 1 Upgrading a driver
  4. Click OK to start the driver upgrade.