AI Inference Gateway Add-on

With the rapid development of large language models (LLMs) and AI inference services, cloud native AI teams struggle with increasingly complex inference traffic management. In AI inference applications, in addition to path-based traffic routing over HTTP, traffic distribution and grayscale releases must be performed flexibly based on AI service attributes such as model name, inference priority, and model version to meet diversified service requirements.

For this reason, CCE standard and Turbo clusters provide the AI Inference Gateway add-on that was built based on Gateway API Inference Extension to simplify the interaction between frontend and backend services and improve the performance, reliability, and security of AI services. This add-on is an inference traffic management solution launched by the Kubernetes community based on Gateway API. It has the following advantages:

Model-aware routing: Traffic can be distributed based on HTTP path matching as well as service attributes such as model names, model versions, and Low-Rank Adaptation (LoRA).
Inference priority-based scheduling: Users can set criticality for different models or service scenarios to achieve more flexible task scheduling and resource utilization.
Grayscale releases of models: Traffic can be split based on model names for smooth rollout of new model versions.
Intelligent load balancing: Real-time metrics of model servers are collected for selecting endpoints based on the load and health, which reduces the inference latency and improves the resource usage.
Open, expandable standards: The solution fully complies with the Kubernetes gateway API ecosystem and supports custom inference route extension, facilitating integration with mainstream gateways such as Istio with Gateway API enabled.

This add-on is suitable for AI production scenarios that require flexible scheduling, stable inference, and multi-environment collaboration.

In online inference of LLMs, this add-on can implement intelligent traffic management and precise scheduling based on the model name, version, and priority.
On the AI inference platform where multiple models and versions coexist, this add-on allows grayscale releases and flexible traffic distribution to ensure high resource utilization and improved inference performance.
On the cloud native AI platform, this add-on can seamlessly integrate mainstream inference engines (such as vLLM and Triton) to ensure high scalability and flexible deployment of the platform.

Prerequisites

You have created a CCE standard or Turbo cluster of v1.30 or later.
You have created nodes and add-ons of the corresponding type based on the service type, and the node resources are sufficient.
- NPU services: NPU nodes with the add-on described in CCE AI Suite (Ascend NPU) installed
- GPU services: GPU nodes with the add-on described in CCE AI Suite (NVIDIA GPU) installed
You have installed Istio (dev version: 1.26-alpha.80c74f7f43482c226f4f4b10b4dda6261b67a71f) and enabled and the Gateway API capabilities. For details, see Getting Started and Kubernetes Gateway API.
- Istio is used to provide the basic routing capability. In AI scenarios, after this add-on is installed, intelligent traffic distribution and custom routing can be implemented at the traffic ingress layer based on service attributes such as model names. This ensures that external requests can be forwarded to the corresponding AI model service instances based on service requirements.
- The Gateway API capabilities define the entry for external traffic to reach the Kubernetes cluster. You need to specify the listening port, protocol, and bound gateway address. All HTTP traffic (such as calls of the /v1/completions API) flows to the AI inference service through the unified entry.
(Optional) If there is a large volume of traffic, you have created a LoadBalancer Service as the entry of the production traffic. For details, see LoadBalancer.

Precautions

After the add-on is installed, the native Kubernetes traffic ingress resources such as Services and Ingresses are not affected.
This add-on is experimental. Its functions evolve rapidly with the community. Pay attention to version compatibility if an upgrade or change is required.
This add-on is being deployed. For details about the regions where this add-on is available, see the console.
This add-on is in the OBT phase. You can experience the latest add-on features. However, the stability of this add-on version has not been fully verified, and the CCE SLA is not valid for this version.

Installing the Add-on

Before installing the add-on, ensure that Istio (dev version: 1.26-alpha.80c74f7f43482c226f4f4b10b4dda6261b67a71f) has been installed and the Gateway API capabilities have been enabled. Otherwise, the add-on may fail to be installed.

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Add-ons. On the displayed page, locate AI Inference Gateway and click Install.
In the lower right corner of the Install Add-on page, click Install. Inference Extension CRDs (excluding the runtime container) will be automatically deployed. If Installed is displayed in the upper right corner of the add-on card, the installation is successful.

You can deploy AI traffic management resources such as HTTPRoute, InferenceModel, and InferencePool using kubectl.

Uninstalling the Add-on

Log in to the CCE console and click the cluster name to access the cluster console.
In the navigation pane, choose Add-ons. On the displayed page, locate AI Inference Gateway and click Uninstall.
In the Uninstall Add-on dialog box, enter DELETE and click OK. If Not installed is displayed in the upper right corner of the add-on card, the add-on has been uninstalled.

Before uninstalling the add-on, you can clear the related custom resources such as InferenceModel, InferencePool, and HTTPRoute.

Use Case

The following is a typical usage of the add-on. For details, see Getting Started with AI Inference Gateway, a case provided by Kubernetes Special Interest Groups (SIGs). In this case, vLLM is deployed on GPU-based model servers. Downloading external YAML files is required, so you need to bind an EIP to any node in the cluster. EIP is billed. For details, see EIP Price Calculator.

Install kubectl on an existing ECS and access the cluster using kubectl. For details, see Accessing a Cluster Using kubectl.

Deploy the vLLM service using gpu-deployment.yaml. The AI inference service instances will be created, which will be used for subsequent traffic distribution, model inference, and load balancing of AI Gateway.

kubectl create secret generic hf-token --from-literal=token=<your-hf-token>
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/main/config/manifests/vllm/gpu-deployment.yaml

Deploy the InferenceModel using inferencemodel.yaml and the InferencePool using inferencepool-resources.yaml. Create resources such as InferenceModel and InferencePool to define the attributes, versions, priorities, and backend service pools of model services. In this way, the AI Gateway can orchestrate multiple models or instances in a unified manner and manage traffic.
```
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/heads/release-0.3/config/manifests/inferencemodel.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/heads/release-0.3/config/manifests/inferencepool-resources.yaml
```
Deploy DestinationRule using destination-rule.yaml and custom HTTPRoute using httproute.yaml.
```
# Deploy DestinationRule.
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/manifests/gateway/istio/destination-rule.yaml
# Deploy custom HTTPRoute.
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.0/config/manifests/gateway/istio/httproute.yaml
```
The YAML files are described as follows:
- destination-rule.yaml: defines the connection policy and load balancing behavior of Istio in forwarding inter-service traffic. In this example, the file is used to temporarily disable TLS verification (because the add-on uses a self-signed certificate) to ensure that traffic can be forwarded between the Gateway and backend inference service without being intercepted by TLS.
- httproute.yaml: deploys custom HTTPRoute and defines detailed forwarding and routing rules after traffic flows to the Gateway entry. Smart routing based on AI service attributes such as model names, API paths, and request headers is supported so that requests can be distributed to specific model services, versions, or pools on demand.
Check the HTTPRoute status. If "Accepted=True" and "ResolvedRefs=True" are displayed, the custom HTTPRoute is successfully deployed and traffic management takes effect.
```
kubectl get httproute llm-route -o yaml
```

Obtain the Gateway IP address and initiate an inference request to verify the inference service.

IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}')
PORT=80
curl -i ${IP}:${PORT}/v1/completions -H 'Content-Type: application/json' -d '{
  "model": "food-review",
  "prompt": "Write as if you were a critic: San Francisco",
  "max_tokens": 100,
  "temperature": 0
}'