Upgrading the GPU Driver on a Lite Server
Description
In high-performance computing and deep learning, users often require the latest GPU drivers and related software to optimize computational performance. However, many GPU models currently on the market come pre-installed with outdated driver and software versions, which can lead to compatibility issues when users attempt to use the latest versions of CUDA.
To enhance user experience, Lite Servers provide a one-click software upgrade feature. This supports the automated upgrading of GPU drivers, CUDA, nvidia-fabricmanager, nv_peer_mem, and NCCL. Users can query supported software versions via commands and dispatch upgrade tasks, eliminating the tedious process of manually logging into different servers for software downloads, installation, and verification. Furthermore, the upgrade process automatically handles the deprecation of nv_peer_mem and the enablement of nvidia-peermem, ensuring version consistency across all components while improving system stability and reliability.
| Software | Description | Version |
|---|---|---|
| GPU driver | GPU driver, which has a specific compatibility relationship with CUDA. | 550.90.07 |
| CUDA | A parallel computing platform and programming model used to develop GPU-accelerated applications. | 12.4 |
| nvidia-fabricmanager | Resource management and scheduling; manages NVLink, GPU, and network resources in multi-GPU and multi-node environments. | Matches the NVIDIA driver version. 550.90.07 |
| nv_peer_mem | Data transfer acceleration; enables GPU Direct RDMA to optimize the data path between GPUs and NICs. | nv_peer_mem was deprecated in CUDA 11.5; its replacement (nvidia_peermem) is now integrated into the driver. |
| NCCL | A distributed communication library used to optimize data transfer efficiency in multi-GPU or multi-node environments. | 2.27.6 |
Constraints
- Do not reset or power off the host or device during the software upgrade process. Doing so may cause the device to fail to boot or lead to an upgrade failure.
- Before upgrading the software package, ensure that no processes are occupying the node, including container mapping.
- Use the driver and fabricmanager versions from the same software version list to ensure version compatibility.
- Currently, only version 550 drivers are supported. Since there is no unified official version for existing user drivers, rollbacks are not supported.
- Supported models: Ant1, Ant8, Hnt02, Lnt002, and Vnt1.
Prerequisites
This operation depends on the Lite Server AI plugin pre-installed on the node. Install the plugin by referring to Installing the AI Plugin for a Lite Server.
Procedure
- Log in to the ModelArts console. In the navigation pane, choose Resource Management > Lite Servers. Click the Task Center tab.
- New console: In the navigation pane, choose Resource Management > Lite Compute Resources > Lite Servers.
- Old console: In the navigation pane, choose Resource Management > Lite Servers.
Figure 1 Task center
- Click Create Task in the upper left corner. On the displayed Job Templates page, locate Driver Component Upgrade, and click Create Task. Figure 2 Task template
- On the displayed page, enter the Name and Description, select the Task and Server Model, and click Select Node. After selecting the nodes from the node list and clicking OK, the system will dispatch a driver and firmware version query task to the corresponding nodes. This process takes approximately one minute to retrieve the actual driver and firmware information.
Table 2 Parameters for creating a task Parameter
Description
Name
Modify the auto-generated task name as required.
Description
Enter a description for the task to help with quick identification and tracking.
Task
Select Driver Upgrade.
Server Model
Select Ant1, Ant8, Hnt02, Lnt002 or Vnt1.
Select Node
Click Select Node to choose the nodes requiring driver or firmware upgrades from the list. You can batch select nodes or search for specific nodes using keywords. Click OK to confirm.
Select Driver and Firmware Versions
Select the target version for the driver components from the drop-down list.
Ensure that the target version is compatible with your software to prevent upgrade failures or service interruptions.
Rollbacks are not supported for this upgrade. Perform a thorough risk assessment and back up your data before proceeding.
Verify the driver version:
nvidia-smi
- After selecting the target driver version, click Next to verify the upgrade details. Click OK to dispatch the upgrade task. Once the task is initiated, the entire upgrade process takes approximately one hour for Ant1 models and about 30 minutes for other models.
- During the upgrade, you can return to the Task Center page to monitor the execution status. Click a specific task name to access the task details page, where you can view detailed progress and logs.
- After the process completes, run the relevant commands on the node to verify whether the driver has been loaded.
nvidia-smi
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot