Why Does a Driver Installed Using the CCE AI Suite (NVIDIA GPU) Add-on Fail to Execute Encoding and Decoding Tasks?
Symptom
After a driver is installed using the CCE AI Suite (NVIDIA GPU) add-on of a certain version, the encoding and decoding tasks may fail to be executed. The error message displayed is "Cuda Encode error: ......". The affected add-on and cluster versions are as follows:
- In a cluster of v1.27 or earlier, the version of the installed add-on is earlier than 2.2.1.
- In a cluster of v1.28 or later, the version of the installed add-on is earlier than 2.8.1.
Possible Cause
This problem typically arises due to a missing configuration file (10_nvidia.json) on the GPU node where the pod runs. To locate the fault, perform the following operations:
- Log in to the GPU node.
- Check whether /usr/share/glvnd/egl_vendor.d/10_nvidia.json exists.
ls /usr/share/glvnd/egl_vendor.d/10_nvidia.json
If information similar to the following is displayed, the file does not exist:
ls: cannot access '/usr/share/glvnd/egl_vendor.d/10_nvidia.json': No such file or directory
Solution
If the 10_nvidia.json file does not exist on the GPU node, upgrade CCE AI Suite (NVIDIA GPU) to 2.8.1 or 2.2.1 or later and then restart the node to reset the driver. Before restarting the GPU node, drain it to prevent services from being affected.
- Log in to the CCE console and click the cluster name to access the cluster console.
- In the navigation pane, choose Add-ons. In the right pane, find the CCE AI Suite (NVIDIA GPU) add-on and click Upgrade. In the window that slides out from the right, click the version number next to Add-on Versions and select the target version from the drop-down list.
After the setting, click OK in the lower right corner. If the add-on status transitions from Upgrading to Running, the add-on has been upgraded.
- Restart the GPU node. Before restarting the GPU node, drain the pods on the node. For details, see Draining a Node. When draining a GPU node, make sure to reserve enough GPU resources on other nodes for pod scheduling needs. This helps avoid pod scheduling issues due to inadequate resources and ensures smooth service operation.
- In the navigation pane, choose Nodes. In the right pane, click the Nodes tab and find the GPU node.
- Click the GPU node name. On the page displayed, click Restart in the upper right corner. After the GPU node is restarted, wait for 5 to 10 minutes for the driver to reset.
- Log in to the node and check whether /usr/share/glvnd/egl_vendor.d/10_nvidia.json exists.
ls /usr/share/glvnd/egl_vendor.d/10_nvidia.json
If the file exists, the path will be displayed.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot