What Should I Do If an Error Occurs When I Deploy a Service on a GPU Node?

Symptom

The following exceptions occur when services are deployed on the GPU nodes in a CCE cluster:

The GPU memory of containers cannot be obtained.
Seven GPU services are deployed, but only two of them can be accessed properly. Errors are reported during the startup of the remaining five services.
- The CUDA versions of the two services that can be accessed properly are 10.1 and 10.0, respectively.
- The CUDA versions of the failing services are also 10.0 and 10.1.
Files named core.* are found in the GPU service containers. No such files existed in any of the previous deployments.

Fault Locating

The CCE AI Suite (NVIDIA GPU) add-on has an outdated driver version. After a new driver is downloaded and installed, the fault is rectified.
You did not specify the requirements for GPUs in workloads.

Suggested Solution

After you install gpu-beta (gpu-device-plugin) on a node, nvidia-smi will be automatically installed. If an error is reported during GPU deployment, this issue is typically caused by an NVIDIA driver installation failure. Check whether the NVIDIA driver has been downloaded.

GPU node:
- If the add-on version is earlier than 2.0.0, run the following command:
```
cd /opt/cloud/cce/nvidia/bin && ./nvidia-smi
```
- If the add-on version is 2.0.0 or later, run the following command:
```
cd /usr/local/nvidia/bin && ./nvidia-smi
```
Pod:
- If the cluster version is v1.27 or earlier, run the following command:
```
cd /usr/local/nvidia/bin && ./nvidia-smi
```
- If the cluster version is v1.28 or later, run the following command:
```
cd /usr/bin && ./nvidia-smi
```

If GPU information is returned, the device is available and the add-on has been installed.

If the driver address is incorrect, uninstall the add-on, reinstall it, and configure the correct address.

You are advised to store the NVIDIA driver in the OBS bucket and set the bucket policy to public read.

Helpful Links

Parent Topic: Workload Exception Troubleshooting

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot