Updated on 2024-12-26 GMT+08:00

Upgrading the Standard Dedicated Resource Pool Driver

Description

If GPUs or Ascend resources are used in a dedicated resource pool, you may need to customize GPU or Ascend drivers. ModelArts allows you to upgrade GPU or Ascend drivers of your dedicated resource pools.

There are two driver upgrade modes: secure upgrade and forcible upgrade.

  • Secure upgrade: Running services are not affected. After the upgrade starts, the nodes are isolated (new jobs cannot be delivered). After the existing jobs on the nodes are complete, the upgrade is performed. The secure upgrade may take a long time because the jobs must be completed first.
  • Forcible upgrade: The drivers are directly upgraded, regardless of whether there are running jobs.

Constraints

  • The target dedicated resource pool must be running, and the resource pool contains GPU or Ascend resources.
  • For a logical resource pool, the driver can be upgraded only after node binding is enabled. To enable node binding, submit a service ticket to contact Huawei engineers.

Upgrading the Driver

  1. Log in to the ModelArts console. In the navigation pane on the left, choose AI Dedicated Resource Pools > Elastic Clusters.
  2. Locate the target resource pool in the list and choose > Upgrade Driver in the Operation column.
  3. In the displayed dialog box, you can view the driver type, number of instances, current version, target version, upgrade mode, upgrade scope, and rolling switch of the dedicated resource pool.
    • Target Version: Select a target driver version from the drop-down list. The driver of the added nodes may not be that of the existing nodes. Select the current driver version for Target Version. After the upgrade, all nodes will be upgraded to the same version
    • Upgrade mode: You can select secure upgrade or forcible upgrade.
      • Secure upgrade: Perform the upgrade when no job is running on the node. The upgrade may take a long time.
      • Forcible upgrade: Ignore the running jobs and perform the upgrade directly. This may cause the running jobs to fail.
    • Rolling Mode: Once enabled, you can upgrade the driver in rolling mode. Currently, By node percentage and By instance quantity are supported.
      • By node percentage: The number of instance to be upgraded is the percentage multiplied by the total number of instances in the resource pool.
      • By instance quantity: The number of instances to be upgraded is the value of this parameter.

      For different upgrade modes, the policies for upgrading nodes are different.

      • If Secure upgrade is selected, the instances without services are upgraded.
      • If Forcible upgrade is selected, random instances are upgraded.
      • To check whether a node has any service, go to the resource pool details page. In the Nodes tab, check whether all GPUs and Ascend chips are available. If yes, the node has no services.
      • During the rolling upgrade, the nodes with abnormal drivers do not affect the upgrade and will also be upgraded.
    Figure 1 Upgrading a driver

  4. Click OK to start the driver upgrade.