Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

CCE AI Suite (NVIDIA GPU)

Updated on 2024-11-06 GMT+08:00

Add-on Overview

CCE AI Suite (NVIDIA GPU) is a device management add-on that supports GPUs in containers. To use GPU nodes in a cluster, this add-on must be installed.

Add-on Parameters

Table 1 Parameters

Parameter

Mandatory

Type

Description

basic

Yes

object

Basic add-on configuration parameters

custom

Yes

Table 3 object

Custom parameters

Table 2 Configuration of basic

Parameter

Mandatory

Type

Description

cluster_version

No

String

CCE cluster version

device_version

Yes

String

Add-on version

driver_version

Yes

String

Image tag of an add-on pod where a driver is installed. Generally, the value is the same as that of device_version.

obs_url

Yes

String

When a GPU driver is downloaded from the default driver address, the value is the GPU driver address.

swr_addr

Yes

String

Image repository address

swr_user

Yes

String

Tenant path of an image repository

Table 3 Configuration of custom

Parameter

Mandatory

Type

Description

compatible_with_legacy_api

No

Bool

API compatibility switch

Default value: false

true: The add-on supports the GPU native mode and xGPU virtualization.

component_schedulername

Yes

String

Name of the scheduler used by the add-on.

Default value: default-scheduler

disable_mount_path_v1

No

Bool

Default value: false

true: /opt/cloud/cce/nvidia is not mounted to the /usr/lib/nvidia directory of a GPU container.

disable_nvidia_gsp

No

Bool

Default value: true

true: The GPU GSP firmware is disabled.

driver_mount_paths

No

String

Driver file directory that needs to be automatically mounted to a GPU container

Default value: "bin,lib64"

enable_fault_isolation

No

Bool

Default value: true

true: The add-on detects hardware faults or driver issues of a GPU and then sets the GPU to be unavailable.

enable_health_monitoring

No

Bool

Default value: true

true: The add-on detects hardware faults or driver issues of a GPU.

enable_metrics_monitoring

No

Bool

Default value: true

true: The add-on collects GPU metrics and reports these metrics to Prometheus.

enable_simple_lib64_mount

No

Bool

Default value: true

true: Only the libxxx.so.x file is mounted to a container.

enable_xgpu

No

Bool

Default value: false

Whether to enable xGPU virtualization.

gpu_driver_config

No

Map

Configurations of the GPU driver for a single node pool

Default value: {}

health_check_xids_v2

No

String

GPU error range for the add-on health checks

Default value: "74,79"

inject_ld_Library_path

No

String

Value of the LD_LIBRARY_PATH environment variable automatically injected by the add-on to a GPU container

Default value: ""

lib64_container_paths

No

String

Mount path of NVIDIA lib64 in a GPU container

Default value: "/usr/lib64,/usr/lib/x86_64-linux-gnu"

metrics_delete_interval

No

int

Timeout threshold for deleting a metric when the metric cannot be obtained. The unit is millisecond.

Default value: 30000

metrics_monitor_interval

No

int

Interval for obtaining metrics, in milliseconds.

Default value: 15000

nvidia_driver_download_url

Yes

String

Path for downloading the NVIDIA driver

Default value: ""

Example Request

{
  "kind": "Addon",
  "apiVersion": "v3",
  "metadata": {
    "name": "gpu-beta",
  },
  "spec": {
    "clusterID": "80c9e306-***-***-***-0255ac100043",
    "version": "2.0.69",
    "addonTemplateName": "gpu-beta",
    "values": {
      "basic": {
        "cluster_version": "v1.27",
        "device_version": "2.0.69",
        "driver_version": "2.0.69",
        "obs_url": "***",
        "region": "***",
        "swr_addr": "***",
        "swr_user": "***"
      },
      "custom": {
        "compatible_with_legacy_api": true,
        "component_schedulername": "kube-scheduler",
        "disable_mount_path_v1": false,
        "disable_nvidia_gsp": true,
        "driver_mount_paths": "bin,lib64",
        "enable_fault_isolation": true,
        "enable_health_monitoring": true,
        "enable_metrics_monitoring": true,
        "enable_simple_lib64_mount": true,
        "enable_xgpu": true,
        "gpu_driver_config": {},
        "health_check_xids_v2": "74,79",
        "inject_ld_Library_path": "",
        "lib64_container_paths": "/usr/lib64,/usr/lib/x86_64-linux-gnu",
        "metrics_delete_interval": 30000,
        "metrics_monitor_interval": 15000,
        "nvidia_driver_download_url": ""
      },
    }
  }
}

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback