Help Center/ Cloud Container Engine/ Best Practices/ Cloud Native AI/ Using InferencePool and Envoy Gateway to Build an AI Infrastructure Layer

Updated on 2026-06-17 GMT+08:00

Using InferencePool and Envoy Gateway to Build an AI Infrastructure Layer

This solution leverages Envoy Gateway, Envoy AI Gateway, and InferencePool to address the key challenges enterprises face when deploying generative AI services in production environments. These challenges include vendor lock-in, weak security controls, limited cost visibility, and complex O&M. The goal is to build a stable, secure, highly scalable AI infrastructure layer for enterprises.

Background

In the current deployment architecture of large language model (LLM) inference services, enterprises are shifting from traditional microservice architectures to AI-native architectures. As generative AI continues to move to production environments, infrastructure teams are encountering several challenges.

Vendor lock-in and fragile connectivity: Enterprises often need to integrate with multiple LLM providers, such as OpenAI, Anthropic, and AWS Bedrock, alongside self-built models. Without a unified abstraction layer, switching providers or handling single-point failures can lead to service interruptions. Cross-provider automatic disaster recovery becomes difficult, impacting service continuity.
Lack of enterprise-grade security isolation: There is no unified access control (RBAC) or rate limiting for AI service invocation. In addition, directly exposing API keys to applications can cause sensitive credential leakage, and the absence of unified identity authentication for egress traffic introduces significant security vulnerabilities.
"Black-box" cost and performance visibility: Calling LLMs is expensive, and service response latency fluctuates greatly. Traditional monitoring tools cannot deeply analyze token consumption, model usage patterns, or response performance. As a result, enterprises struggle to understand cost structures and performance bottlenecks of generative AI, and cannot effectively optimize and allocate resources.
Lack of traffic management standards: As LLM inference scales, there is no standardized approach, such as native Kubernetes APIs, to manage model version switching, weighted traffic routing, or complex header-based routing. O&M complexity grows exponentially with the number of models, reducing O&M efficiency.

Solution

To address these challenges, this solution adopts an integrated architecture combining Envoy Gateway, Envoy AI Gateway, and InferencePool, built on Envoy's mature, production-proven agent technology. This architecture provides a stable, secure, highly scalable AI infrastructure layer that eliminates generative AI deployment bottlenecks.

Standard implementation based on Kubernetes Gateway API
- Advanced traffic management: It uses HTTPRoute for weighted traffic splitting of inference services and supports progressive deployment policies such as blue-green deployment.
- In-depth protocol routing: It can identify OpenAI-compatible protocol headers and enables fine-grained traffic routing based on the model field or custom service headers.
- Native mesh compatibility: It seamlessly integrates with popular service meshes, such as Istio, for end-to-end traffic encryption and centralized governance.
Cross-provider scalable connectivity
- Multi-source abstraction: It can access to cloud services, such as OpenAI and AWS Bedrock, and enterprise-built InferencePools for a unified interface for calling models centrally.
- Intelligent DR: When a self-built model pool is overloaded or third-party APIs are unavailable, the system automatically switches to the standby model to ensure high service availability and service continuity.
Enterprise-grade security and compliance
- Upstream authentication: It manages providers' API keys at the gateway layer centrally and isolates the application layer from credentials to reduce leakage risks.
- Fine-grained management and control: It supports policy-based access control and multi-dimensional rate limiting to prevent API abuse and ensure system stability and compliance.
Comprehensive observability and scalability
- Cost and performance analytics: It traces token consumption, model usage distribution, and response latency in real time, provides key performance indicators (KPIs) for enterprises, and supports resource optimization and cost control.
- Pluggable architecture: It inherits the extension capabilities of Envoy, supports quick development of customized functions, such as request rewriting and custom filters, through plug-ins, and flexibly adapts to the ever-evolving AI technology ecosystem.

For more information about Envoy AI Gateway, see Envoy AI Gateway Overview.

Prerequisites

A cluster of v1.32 or later is available.

The needed images are ready.

The needed images are shown below. Download them on a PC that can access the Internet and has Docker installed in advance.

Download the images.

docker pull docker.io/envoyproxy/gateway-dev:latest
docker pull docker.io/envoyproxy/ratelimit:master
docker pull docker.io/envoyproxy/ai-gateway-extproc
docker pull docker.io/envoyproxy/ai-gateway-controller
docker pull ghcr.io/llm-d/llm-d-inference-sim:v0.4.0
docker pull registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
docker pull docker.io/envoyproxy/ai-gateway-testupstream:latest
docker pull docker.io/envoyproxy/envoy:distroless-dev

Push the downloaded images to the SWR image repository to ensure that all nodes in the Kubernetes cluster can pull them.
For details about how to push an image, see Pushing an Image.

Procedure

Install Envoy Gateway on a node.

Install Helm. Helm 3.19.3 is used as an example.

curl -O https://get.helm.sh/helm-v3.19.3-linux-amd64.tar.gz
tar xvf helm-v3.19.3-linux-amd64.tar.gz 
cp ./linux-amd64/helm /usr/local/bin/ 
helm version

If information similar to the following is displayed, the tool has been installed:

version.BuildInfo{Version:"v3.19.3", GitCommit:"0707f566a3f4ced24009ef14d67fe0ce69db****", GitTreeState:"clean", GoVersion:"go1.24.10"}

Obtain the Helm template package.

helm pull oci://docker.io/envoyproxy/gateway-helm --version v0.0.0-latest
tar xvf gateway-helm-v0.0.0-latest.tgz
cd gateway-helm

Modify the image information in the values.yaml file.
```
vi values.yaml
```
Replace the default image with the one that has been pushed to SWR.
```
docker.io/envoyproxy/gateway-dev:latest
docker.io/envoyproxy/ratelimit:master
```

Prepare the Envoy Gateway configuration file.

Create a basic configuration file named envoy-gateway-values.yaml.

# Copyright Envoy AI Gateway Authors
# SPDX-License-Identifier: Apache-2.0
# The full text of the Apache license is available in the LICENSE file at
# the root of the repo.


# This file contains the base Envoy Gateway helm values needed for AI Gateway integration.
# This is the minimal configuration that all AI Gateway deployments need.
#
# Use this file when installing Envoy Gateway with:
#   helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
#     --version v0.0.0-latest \
#     --namespace envoy-gateway-system \
#     --create-namespace \
#     -f envoy-gateway-values.yaml
#
# For additional features, combine with addon values files:
#   -f envoy-gateway-values.yaml -f examples/token_ratelimit/envoy-gateway-values-addon.yaml
#   -f envoy-gateway-values.yaml -f examples/inference-pool/envoy-gateway-values-addon.yaml


config:
  envoyGateway:
    gateway:
      controllerName: gateway.envoyproxy.io/gatewayclass-controller
    logging:
      level:
        default: info
    provider:
      type: Kubernetes
    extensionApis:
      # Not strictly required, but recommended for backward/future compatibility.
      enableEnvoyPatchPolicy: true
      # Required: Enable Backend API for AI service backends.
      enableBackend: true
    # Required: AI Gateway needs to fine-tune xDS resources generated by Envoy Gateway.
    extensionManager:
      hooks:
        xdsTranslator:
          translation:
            listener:
              includeAll: true
            route:
              includeAll: true
            cluster:
              includeAll: true
            secret:
              includeAll: true
          post:
            - Translation
            - Cluster
            - Route
      service:
        fqdn:
          # IMPORTANT: Update this to match your AI Gateway controller service
          # Format: <service-name>.<namespace>.svc.cluster.local
          # Default if you followed the installation steps above:
          hostname: ai-gateway-controller.envoy-ai-gateway-system.svc.cluster.local
          port: 1063

Create a plug-in configuration file named envoy-gateway-values-addon.yaml.

# Copyright Envoy AI Gateway Authors
# SPDX-License-Identifier: Apache-2.0
# The full text of the Apache license is available in the LICENSE file at
# the root of the repo.

# This addon file adds InferencePool support to Envoy Gateway.
# Use this in combination with the base envoy-gateway-values.yaml:
#
#   helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
#     --version v0.0.0-latest \
#     --namespace envoy-gateway-system \
#     --create-namespace \
#     -f ../../manifests/envoy-gateway-values.yaml \
#     -f envoy-gateway-values-addon.yaml
#
# You can also combine with rate limiting:
#   -f ../../manifests/envoy-gateway-values.yaml \
#   -f ../token_ratelimit/envoy-gateway-values-addon.yaml \
#   -f envoy-gateway-values-addon.yaml

config:
  envoyGateway:
    extensionManager:
      # Enable InferencePool custom resource support
      backendResources:
        - group: inference.networking.k8s.io
          kind: InferencePool
          version: v1

Install Envoy Gateway.

helm upgrade -i eg . \
  --version v0.0.0-latest \
  --namespace envoy-gateway-system \
  --create-namespace \
  -f envoy-gateway-values.yaml \
  -f envoy-gateway-values-addon.yaml

If the value of STATUS is deployed in the command output, the tool has been installed.

Click to enlarge

Verify the deployment status on the console.
1. Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the status of envoy-gateway is Running.
2. On the Services tab, check whether the Service associated with envoy-gateway has been created properly.

Obtain and install the InferencePool CRD.

wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml
kubectl apply -f manifests.yaml

Install Envoy AI Gateway.
1. Obtain and install the Envoy AI Gateway CRD.
```
# Obtain the Helm package of the Envoy AI Gateway CRD.
helm pull oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest

# Extract the package.
tar ai-gateway-crds-helm-v0.0.0-latest.tgz

# Install the Envoy AI Gateway CRD.
cd ai-gateway-crds-helm

helm upgrade -i aieg-crd . \
  --version v0.0.0-latest \
  --namespace envoy-ai-gateway-system \
  --create-namespace
```
  If the value of STATUS is deployed in the command output, the tool has been installed.
2. Obtain the Helm package of the Envoy AI Gateway controller and extract it.
```
helm pull oci://docker.io/envoyproxy/ai-gateway-helm --version v0.0.0-latest
tar ai-gateway-helm-v0.0.0-latest.tgz
```
3. Go to the corresponding directory and edit the image information in the values.yaml file.
```
cd ai-gateway-helm
vi values.yaml
```
  Replace the default image with the one that has been pushed to SWR.
```
docker.io/envoyproxy/ai-gateway-extproc
docker.io/envoyproxy/ai-gateway-controller
```
4. Install the Envoy AI Gateway controller.
```
helm upgrade -i aieg . \
  --version v0.0.0-latest \
  --namespace envoy-ai-gateway-system \
  --create-namespace
```
  If the value of STATUS is deployed in the command output, the tool has been installed.
5. Verify the deployment status on the console.
  1. Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the status of ai-gateway-controller is Running.
  2. On the Services tab, check whether the Service associated with ai-gateway-controller has been created properly.

Deploy the workload and test the Gateway capability.

Obtain and deploy the simulated vLLM model (Llama3-8b).

Obtain the configuration file.

# vLLM simulation backend
wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/vllm/sim-deployment.yaml
# InferenceObjective
wget https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/v1.0.1/config/manifests/inferenceobjective.yaml
# InferencePool resources
wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/inferencepool-resources.yaml

Modify the image information in the sim-deployment.yaml file.
```
vi sim-deployment.yaml
```
Replace the default image with the one that has been pushed to SWR.
```
ghcr.io/llm-d/llm-d-inference-sim:v0.4.0
```
Modify the image information in the inferencepool-resources.yaml file.
```
vi inferencepool-resources.yaml
```
Replace the default image with the one that has been pushed to SWR.
```
registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
```

Obtain and deploy the simulated Mistral.

Replace the image name with the image prepared in the prerequisites.

docker.io/envoyproxy/ai-gateway-testupstream:latest
registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1

The following is a code example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-upstream
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mistral-upstream
  template:
    metadata:
      labels:
        app: mistral-upstream
    spec:
      containers:
        - name: testupstream
          image: docker.io/envoyproxy/ai-gateway-testupstream:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          env:
            - name: TESTUPSTREAM_ID
              value: test
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 1
            periodSeconds: 1
---
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: mistral
  namespace: default
spec:
  targetPorts:
    - number: 8080
  selector:
    matchLabels:
      app: mistral-upstream
  endpointPickerRef:
    name: mistral-epp
    port:
      number: 9002
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: mistral
  namespace: default
spec:
  priority: 10
  poolRef:
    # Bind the InferenceObjective to the InferencePool.
    name: mistral
---
apiVersion: v1
kind: Service
metadata:
  name: mistral-epp
  namespace: default
spec:
  selector:
    app: mistral-epp
  ports:
    - protocol: TCP
      port: 9002
      targetPort: 9002
      appProtocol: http2
  type: ClusterIP
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mistral-epp
  namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-epp
  namespace: default
  labels:
    app: mistral-epp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-epp
  template:
    metadata:
      labels:
        app: mistral-epp
    spec:
      serviceAccountName: mistral-epp
      # Conservatively, this timeout should mirror the longest grace period of the pods within the pool
      terminationGracePeriodSeconds: 130
      containers:
        - name: epp
          image: registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
          imagePullPolicy: IfNotPresent
          args:
            - --pool-name
            - "mistral"
            - "--pool-namespace"
            - "default"
            - --v
            - "4"
            - --zap-encoder
            - "json"
            - --grpc-port
            - "9002"
            - --grpc-health-port
            - "9003"
            - "--config-file"
            - "/config/default-plugins.yaml"
          ports:
            - containerPort: 9002
            - containerPort: 9003
            - name: metrics
              containerPort: 9090
          livenessProbe:
            grpc:
              port: 9003
              service: inference-extension
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            grpc:
              port: 9003
              service: inference-extension
            initialDelaySeconds: 5
            periodSeconds: 10
          volumeMounts:
            - name: plugins-config-volume
              mountPath: "/config"
      volumes:
        - name: plugins-config-volume
          configMap:
            name: plugins-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: plugins-config
  namespace: default
data:
  default-plugins.yaml: |
    apiVersion: inference.networking.x-k8s.io/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: queue-scorer
    - type: kv-cache-utilization-scorer
    - type: prefix-cache-scorer
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: queue-scorer
      - pluginRef: kv-cache-utilization-scorer
      - pluginRef: prefix-cache-scorer
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pod-read
  namespace: default
rules:
  - apiGroups: ["inference.networking.x-k8s.io"]
    resources: ["inferenceobjectives", "inferencepools"]
    verbs: ["get", "watch", "list"]
  - apiGroups: ["inference.networking.k8s.io"]
    resources: ["inferencepools"]
    verbs: ["get", "watch", "list"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "watch", "list"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pod-read-binding
  namespace: default
subjects:
  - kind: ServiceAccount
    name: mistral-epp
    namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-read
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: auth-reviewer
rules:
  - apiGroups:
      - authentication.k8s.io
    resources:
      - tokenreviews
    verbs:
      - create
  - apiGroups:
      - authorization.k8s.io
    resources:
      - subjectaccessreviews
    verbs:
      - create
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: auth-reviewer-binding
subjects:
  - kind: ServiceAccount
    name: mistral-epp
    namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: auth-reviewer

Obtain and use an AIServiceBackend to deploy a traditional backend.

Replace the image name with the image prepared in the prerequisites.

docker.io/envoyproxy/ai-gateway-testupstream:latest

The following is a code example:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: Backend
metadata:
  name: envoy-ai-gateway-basic-testupstream
  namespace: default
spec:
  endpoints:
    - fqdn:
        hostname: envoy-ai-gateway-basic-testupstream.default.svc.cluster.local
        port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: envoy-ai-gateway-basic-testupstream
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: envoy-ai-gateway-basic-testupstream
  template:
    metadata:
      labels:
        app: envoy-ai-gateway-basic-testupstream
    spec:
      containers:
        - name: testupstream
          image: docker.io/envoyproxy/ai-gateway-testupstream:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          env:
            - name: TESTUPSTREAM_ID
              value: test
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 1
            periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
  name: envoy-ai-gateway-basic-testupstream
  namespace: default
spec:
  selector:
    app: envoy-ai-gateway-basic-testupstream
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: ClusterIP

Deploy the Gateway.

To ensure the Gateway functions properly on the intranet, customize the Envoy proxy configuration and use a NodePort Service so you can quickly verify the Gateway routing capabilities.

Replace the image name with the image prepared in the prerequisites.

docker.io/envoyproxy/envoy:distroless-dev

The following is a code example:

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: nodeport-config
  namespace: envoy-gateway-system
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: NodePort
      envoyDeployment:
        container:
          image: docker.io/envoyproxy/envoy:distroless-dev
---
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-pool-with-aigwroute
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
  parametersRef:
    group: gateway.envoyproxy.io
    kind: EnvoyProxy
    name: nodeport-config
    namespace: envoy-gateway-system
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-pool-with-aigwroute
  namespace: default
spec:
  gatewayClassName: inference-pool-with-aigwroute
  listeners:
    - name: http
      protocol: HTTP
      port: 80
---
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: inference-pool-with-aigwroute
  namespace: default
spec:
  parentRefs:
    - name: inference-pool-with-aigwroute
      kind: Gateway
      group: gateway.networking.k8s.io
  rules:
    # Route for vLLM Llama model via InferencePool
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: meta-llama/Llama-3.1-8B-Instruct
      backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: vllm-llama3-8b-instruct
    # Route for Mistral model via InferencePool
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: mistral:latest
      backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: mistral
    # Route for traditional backend (non-InferencePool)
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: some-cool-self-hosted-model
      backendRefs:
        - name: envoy-ai-gateway-basic-testupstream

Verify the deployment status.
1. Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the statuses of all workloads are Running.
2. On the Services tab, check whether the needed Services have been created properly.
  On this page, obtain and take a note of the access address and NodePort of the corresponding Service, and combine them in the format of Access-address:NodePort. In all subsequent tests, replace $GATEWAY_IP in the configuration with this address.

Test the Gateway's capability of routing requests to different models or backends.

Test the Llama-3 model route.

Verify the route forwarding from the Gateway to Llama-3.

curl -H "Content-Type: application/json" \
  -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "Hi. Say this is a test"
            }
        ]
    }' \
  http://$GATEWAY_IP/v1/chat/completions

If information similar to the following is returned, the model is working:

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"The temperature there is twenty-five degrees centigrade. Give a man a fish and you feed him for a day; Teach a man to fish",role":"assistant"}}],"created":1767755896,"do_remote_decode":false,"do_remote_prefill":false,"id":"chatcmp-561ca69e-9716-411f-9656-7a96d9******","model":"meta-llama/llama-3.1-8B-Instruct","object":"chat.completion","remote_block_id":"","remote_engine_id":"","remote_host":"","remote_port":0,"usage":{"completion_tokens":28,"prompt_tokens":7,"total_tokens":35},

Test the Mistral model route.

Verify the route forwarding from the Gateway to Mistral.

curl -H "Content-Type: application/json" \
  -d '{
        "model": "mistral:latest",
        "messages": [
            {
                "role": "user",
                "content": "Hi. Say this is a test"
            }
        ]
    }' \
  http://$GATEWAY_IP/v1/chat/completions

If information similar to the following is returned, the model is working:

{"choices":[{"message":{"content":"This is a test.","role":"assistant"}}]}

Test the common backend load balancer route.

Verify the route forwarding from the Gateway to the custom backend load balancer.

curl -H "Content-Type: application/json" \
  -d '{
        "model": "some-cool-self-hosted-model",
        "messages": [
            {
                "role": "user",
                "content": "Hi. Say this is a test"
            }
        ]
    }' \
  http://$GATEWAY_IP/v1/chat/completions

If information similar to the following is returned, the model is working:

{"choices":[{"message":{"role":"assistant","content":"I am the captain of my soul."}}]}

Parent Topic: Cloud Native AI

Previous topic: Installing the NVIDIA DRA Driver

Next topic: Enabling Auto Scaling for a GPU Node

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot