Updated on 2024-10-29 GMT+08:00

Upgrading the Standard Dedicated Resource Pool Driver

Description

If GPUs or Ascend resources are used in a dedicated resource pool, you may need to customize GPU or Ascend drivers. ModelArts allows you to upgrade GPU or Ascend drivers of your dedicated resource pools.

There are two driver upgrade modes: secure upgrade and forcible upgrade.

  • Secure upgrade: Running services are not affected. After the upgrade starts, the nodes are isolated (new jobs cannot be delivered). After the existing jobs on the nodes are complete, the upgrade is performed. The secure upgrade may take a long time because the jobs must be completed first.
  • Forcible upgrade: The drivers are directly upgraded, regardless of whether there are running jobs.

Constraints

  • The target dedicated resource pool must be running, and the resource pool contains GPU or Ascend resources.
  • For a logical resource pool, the driver can be upgraded only after node binding is enabled. To enable node binding, submit a service ticket to contact Huawei engineers.

Upgrading the Driver

  1. Log in to the ModelArts console. In the navigation pane on the left, choose AI Dedicated Resource Pools > Elastic Clusters.
  2. In the resource pool list, locate the target resource pool, and choose > Upgrade Driver in the Operation column.
  3. The Upgrade Driver dialog box displays the driver type, number of nodes, current version, target version, and upgrade mode of the dedicated resource pool. Modify the following parameters:
    • Target Version: Select a target driver version from the drop-down list. The driver of the added nodes may not be that of the existing nodes. Select the current driver version for Target Version. After the upgrade, all nodes will be upgraded to the same version
    • Upgrade Mode: Select Secure upgrade or Forcible upgrade.
    • Rolling Mode: Once enabled, you can upgrade the driver in rolling mode. Currently, By node percentage and By node quantity are supported.
      • By node percentage: The number of nodes to be upgraded is the percentage multiplied by the total number of nodes in the resource pool.
      • By node quantity: The number of nodes to be upgraded is the value of this parameter.

      For different upgrade modes, the policies for upgrading nodes are different.

      • If Secure upgrade is selected, the nodes without services are upgraded.
      • If Forcible upgrade is selected, random nodes are upgraded.
      • To check whether a node has any service, go to the resource pool details page. In the Nodes tab, check whether all GPUs and Ascend chips are available. If yes, the node has no services.
        Figure 1 Checking whether a node has services

      • During the rolling upgrade, the nodes with abnormal drivers do not affect the upgrade and will also be upgraded.
    Figure 2 Upgrading a driver

  4. Click OK to start the driver upgrade.