Help Center/ ModelArts/ ModelArts User Guide (Lite Cluster)/ Managing Lite Server Resources/ Upgrading the Lite Cluster Resource Pool Driver
Updated on 2024-11-11 GMT+08:00

Upgrading the Lite Cluster Resource Pool Driver

Scenarios

If GPUs or Ascend resources are used in a dedicated resource pool, you may need to customize GPU or Ascend drivers. ModelArts allows you to upgrade GPU or Ascend drivers of your dedicated resource pools.

There are two driver upgrade modes: secure upgrade and forcible upgrade.

  • Secure upgrade: Running services are not affected. After the upgrade starts, the nodes are isolated (new jobs cannot be delivered). After the existing jobs on the nodes are complete, the upgrade is performed. The secure upgrade may take a long time because the jobs must be completed first.
  • Forcible upgrade: The drivers are directly upgraded, regardless of whether there are running jobs.

Constraints

The target dedicated resource pool must be running, and the resource pool contains GPU or Ascend resources.

Upgrading the Driver

  1. Log in to the ModelArts console. In the navigation pane, choose Dedicated Resource Pools > Elastic Cluster.
  2. In the resource pool list, locate the resource pool for which you want to upgrade the driver, click More and select Upgrade Driver in the Operation column.
  3. The Upgrade Driver dialog box displays the driver type, number of nodes, current version, target version, and upgrade mode of the dedicated resource pool. Modify the following parameters:
    • Target Version: Select a target driver version from the drop-down list.
    • Upgrade Mode: Select Secure upgrade or Forcible upgrade.
    • Rolling Mode: Once enabled, you can upgrade the driver in rolling mode. Currently, By node percentage and By node quantity are supported.
      • By node percentage: The number of nodes to be upgraded is the percentage multiplied by the total number of nodes in the resource pool.
      • By node quantity: The number of nodes to be upgraded is the value of this parameter.

      For different upgrade mode, the policies for upgrading nodes are different.

      • If Secure upgrade is selected, the nodes without services are upgraded.
      • If Forcible upgrade is selected, random nodes are upgraded.
      • To check whether a node has any service, go to the resource pool details page. In the Nodes tab, check whether all GPUs and Ascend chips are available. If yes, the node has no services.
      • During the rolling upgrade, the nodes with abnormal drivers do not affect the upgrade and will also be upgraded.
  4. Click OK to start the driver upgrade.