Help Center> Cloud Container Engine> FAQ> Workload> Workload Abnormalities> What Should I Do If an Error Occurs When Deploying a Service on the GPU Node?

What Should I Do If an Error Occurs When Deploying a Service on the GPU Node?

Symptom

The following exceptions occur when services are deployed on the GPU nodes in a HUAWEI CLOUD CCE cluster:

  1. The GPU memory of containers cannot be queried.
  2. Seven GPU services are deployed, but only two of them can be accessed properly. Errors are reported during the startup of the remaining five services.
    • The CUDA versions of the two services that can be accessed properly are 10.1 and 10.0, respectively.
    • The CUDA versions of the failing services are also 10.0 and 10.1.
  3. Files named core.* are found in the GPU service containers. No such files existed in any of the previous deployments.

Fault Locating

  1. The driver version of the gpu-beta add-on is too old. After a new driver is downloaded and installed, the fault is rectified.
  2. The workloads do not declare that GPU resources are required.

Suggested Solution

After the gpu-beta add-on is installed on the node, the nvidia-smi command line tool is stored in the /var/paas/nvidia/bin directory. If the command line tool is still unavailable after the add-on is installed, the common cause is that the NVIDIA driver fails to be installed. Check whether the NVIDIA driver is downloaded successfully. (The driver file can be found in the /var/paas/nvidia directory.)

If the driver address is incorrect, uninstall the add-on, reinstall it, and configure the correct address.

You are advised to store the NVIDIA driver in the OBS bucket and set the bucket policy to public read.

Object Storage Service (OBS) is an object-based storage service provided by HUAWEI CLOUD. It provides massive, secure, highly reliable, and low-cost data storage capabilities. For details about how to create a bucket and set the bucket policy, see Creating a Bucket, Uploading a File, and Configuring a Standard Bucket Policy.

Submitting a Service Ticket

If the problem persists, submit a service ticket.