Help Center> Cloud Container Engine> FAQ> Chart and Add-on> How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?

How Do I Rectify Failures When the NVIDIA Driver Is Used to Start Containers on GPU Nodes?

Did a Resource Scheduling Failure Event Occur on a Cluster Node?

Symptom

A node is running properly and has GPU resources. However, the following error information is displayed:

0/9 nodes are available: 9 insufficient nvidia.com/gpu

Analysis

Check whether the node is attached with NVIDIA label.
Check whether the NVIDIA driver is running properly.
Log in to the node where the add-on is running and view the driver installation log in the following path:
```
/var/paas/nvidia/nvidia_installer.log
```
View standard output logs of the NVIDIA container.

Filter the container ID by running the following command:
```
docker ps –a | grep nvidia
```
View logs by running the following command:
```
docker logs Container ID
```

What Should I Do If the NVIDIA Version Reported by a Service and the CUDA Version Do Not Match?

Run the following command to check the CUDA version in the container:

cat /usr/local/cuda/version.txt

Check whether the CUDA version supported by the NVIDIA driver version of the node where the container is located contains the CUDA version of the container.

How Do I Upgrade the NVIDIA Driver?

To upgrade the NVIDIA driver to a later version, perform the following steps:

Upgrade the GPU add-on.

Log in to the CCE console. In the navigation pane, choose Add-ons. On the Add-on Instance tab page, click Upgrade under gpu-beta.
(Mandatory) Restart the node.

Restart the node on the ECS console. Log in to the HUAWEI CLOUD management console, select the region where the ECS is located, and choose Service List > Computing > Elastic Cloud Server. In the ECS list, locate the target node, and click More > Restart in the Operation column.

Helpful Links

Parent topic: Chart and Add-on

Did this article solve your problem?

Thank you for your score！Your feedback would help us improve the website.

Products

Compute

Application

Dedicated Cloud

Storage

Management & Deployment

Migration

Network

Enterprise Intelligence

Video

Database

Edge Cloud Services

DevCloud

Security

Cloud Communications

Internet of Things

Solutions

Industry-Specific Solutions

General-Purpose Solutions

Security

DevOps

Enterprise Intelligence

Essential Platform

Big Data

Visual Cognition

Speech and Semantics

Support

Help Center

Customer Services

Developers

Console

语言 - Language

中国站 - 简体中文

中国站 - English

International - 简体中文

International - English