CCE AI Suite (NVIDIA GPU)
Add-on Overview
CCE AI Suite (NVIDIA GPU) is a device management add-on that supports GPUs in containers. To use GPU nodes in a cluster, this add-on must be installed.
Add-on Parameters
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
basic |
Yes |
object |
Basic add-on configuration parameters |
custom |
Yes |
Table 3 object |
Custom parameters |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
cluster_version |
No |
String |
CCE cluster version |
device_version |
Yes |
String |
Add-on version |
driver_version |
Yes |
String |
Image tag of an add-on pod where a driver is installed. Generally, the value is the same as that of device_version. |
obs_url |
Yes |
String |
When a GPU driver is downloaded from the default driver address, the value is the GPU driver address. |
swr_addr |
Yes |
String |
Image repository address |
swr_user |
Yes |
String |
Tenant path of an image repository |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
compatible_with_legacy_api |
No |
Bool |
API compatibility switch Default value: false true: The add-on supports the GPU native mode and xGPU virtualization. |
component_schedulername |
Yes |
String |
Name of the scheduler used by the add-on. Default value: default-scheduler |
disable_mount_path_v1 |
No |
Bool |
Default value: false true: /opt/cloud/cce/nvidia is not mounted to the /usr/lib/nvidia directory of a GPU container. |
disable_nvidia_gsp |
No |
Bool |
Default value: true true: The GPU GSP firmware is disabled. |
driver_mount_paths |
No |
String |
Driver file directory that needs to be automatically mounted to a GPU container Default value: "bin,lib64" |
enable_fault_isolation |
No |
Bool |
Default value: true true: The add-on detects hardware faults or driver issues of a GPU and then sets the GPU to be unavailable. |
enable_health_monitoring |
No |
Bool |
Default value: true true: The add-on detects hardware faults or driver issues of a GPU. |
enable_metrics_monitoring |
No |
Bool |
Default value: true true: The add-on collects GPU metrics and reports these metrics to Prometheus. |
enable_simple_lib64_mount |
No |
Bool |
Default value: true true: Only the libxxx.so.x file is mounted to a container. |
enable_xgpu |
No |
Bool |
Default value: false Whether to enable xGPU virtualization. |
gpu_driver_config |
No |
Map |
Configurations of the GPU driver for a single node pool Default value: {} |
health_check_xids_v2 |
No |
String |
GPU error range for the add-on health checks Default value: "74,79" |
inject_ld_Library_path |
No |
String |
Value of the LD_LIBRARY_PATH environment variable automatically injected by the add-on to a GPU container Default value: "" |
lib64_container_paths |
No |
String |
Mount path of NVIDIA lib64 in a GPU container Default value: "/usr/lib64,/usr/lib/x86_64-linux-gnu" |
metrics_delete_interval |
No |
int |
Timeout threshold for deleting a metric when the metric cannot be obtained. The unit is millisecond. Default value: 30000 |
metrics_monitor_interval |
No |
int |
Interval for obtaining metrics, in milliseconds. Default value: 15000 |
nvidia_driver_download_url |
Yes |
String |
Path for downloading the NVIDIA driver Default value: "" |
Example Request
{ "kind": "Addon", "apiVersion": "v3", "metadata": { "name": "gpu-beta", }, "spec": { "clusterID": "80c9e306-***-***-***-0255ac100043", "version": "2.0.69", "addonTemplateName": "gpu-beta", "values": { "basic": { "cluster_version": "v1.27", "device_version": "2.0.69", "driver_version": "2.0.69", "obs_url": "***", "region": "***", "swr_addr": "***", "swr_user": "***" }, "custom": { "compatible_with_legacy_api": true, "component_schedulername": "kube-scheduler", "disable_mount_path_v1": false, "disable_nvidia_gsp": true, "driver_mount_paths": "bin,lib64", "enable_fault_isolation": true, "enable_health_monitoring": true, "enable_metrics_monitoring": true, "enable_simple_lib64_mount": true, "enable_xgpu": true, "gpu_driver_config": {}, "health_check_xids_v2": "74,79", "inject_ld_Library_path": "", "lib64_container_paths": "/usr/lib64,/usr/lib/x86_64-linux-gnu", "metrics_delete_interval": 30000, "metrics_monitor_interval": 15000, "nvidia_driver_download_url": "" }, } } }
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot