Help Center/ Cloud Container Engine/ Best Practices/ Cloud Native AI/ Using InferencePool and Envoy Gateway to Build an AI Infrastructure Layer
Updated on 2026-03-10 GMT+08:00

Using InferencePool and Envoy Gateway to Build an AI Infrastructure Layer

This solution leverages Envoy Gateway, Envoy AI Gateway, and InferencePool to address the key challenges enterprises face when deploying generative AI services in production environments. These challenges include vendor lock-in, weak security controls, limited cost visibility, and complex O&M. The goal is to build a stable, secure, highly scalable AI infrastructure layer for enterprises.

Background

In the current deployment architecture of large language model (LLM) inference services, enterprises are shifting from traditional microservice architectures to AI-native architectures. As generative AI continues to move to production environments, infrastructure teams are encountering several challenges.

  • Vendor lock-in and fragile connectivity: Enterprises often need to integrate with multiple LLM providers, such as OpenAI, Anthropic, and AWS Bedrock, alongside self-built models. Without a unified abstraction layer, switching providers or handling single-point failures can lead to service interruptions. Cross-provider automatic disaster recovery becomes difficult, impacting service continuity.
  • Lack of enterprise-grade security isolation: There is no unified access control (RBAC) or rate limiting for AI service invocation. In addition, directly exposing API keys to applications can cause sensitive credential leakage, and the absence of unified identity authentication for egress traffic introduces significant security vulnerabilities.
  • "Black-box" cost and performance visibility: Calling LLMs is expensive, and service response latency fluctuates greatly. Traditional monitoring tools cannot deeply analyze token consumption, model usage patterns, or response performance. As a result, enterprises struggle to understand cost structures and performance bottlenecks of generative AI, and cannot effectively optimize and allocate resources.
  • Lack of traffic management standards: As LLM inference scales, there is no standardized approach, such as native Kubernetes APIs, to manage model version switching, weighted traffic routing, or complex header-based routing. O&M complexity grows exponentially with the number of models, reducing O&M efficiency.

Solution

To address these challenges, this solution adopts an integrated architecture combining Envoy Gateway, Envoy AI Gateway, and InferencePool, built on Envoy's mature, production-proven agent technology. This architecture provides a stable, secure, highly scalable AI infrastructure layer that eliminates generative AI deployment bottlenecks.

  • Standard implementation based on Kubernetes Gateway API
    • Advanced traffic management: It uses HTTPRoute for weighted traffic splitting of inference services and supports progressive deployment policies such as blue-green deployment.
    • In-depth protocol routing: It can identify OpenAI-compatible protocol headers and enables fine-grained traffic routing based on the model field or custom service headers.
    • Native mesh compatibility: It seamlessly integrates with popular service meshes, such as Istio, for end-to-end traffic encryption and centralized governance.
  • Cross-provider scalable connectivity
    • Multi-source abstraction: It can access to cloud services, such as OpenAI and AWS Bedrock, and enterprise-built InferencePools for a unified interface for calling models centrally.
    • Intelligent DR: When a self-built model pool is overloaded or third-party APIs are unavailable, the system automatically switches to the standby model to ensure high service availability and service continuity.
  • Enterprise-grade security and compliance
    • Upstream authentication: It manages providers' API keys at the gateway layer centrally and isolates the application layer from credentials to reduce leakage risks.
    • Fine-grained management and control: It supports policy-based access control and multi-dimensional rate limiting to prevent API abuse and ensure system stability and compliance.
  • Comprehensive observability and scalability
    • Cost and performance analytics: It traces token consumption, model usage distribution, and response latency in real time, provides key performance indicators (KPIs) for enterprises, and supports resource optimization and cost control.
    • Pluggable architecture: It inherits the extension capabilities of Envoy, supports quick development of customized functions, such as request rewriting and custom filters, through plug-ins, and flexibly adapts to the ever-evolving AI technology ecosystem.

For more information about Envoy AI Gateway, see Envoy AI Gateway Overview.

Prerequisites

  • A cluster of v1.32 or later is available.
  • The needed images are ready.

    The needed images are shown below. Download them on a PC that can access the Internet in advance.

    1. Download the images.
      docker pull docker.io/envoyproxy/gateway-dev:latest
      docker pull docker.io/envoyproxy/ratelimit:master
      docker pull docker.io/envoyproxy/ai-gateway-extproc
      docker pull docker.io/envoyproxy/ai-gateway-controller
      docker pull ghcr.io/llm-d/llm-d-inference-sim:v0.4.0
      docker pull registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
      docker pull docker.io/envoyproxy/ai-gateway-testupstream:latest
      docker pull docker.io/envoyproxy/envoy:distroless-dev
    2. Push the downloaded images to the SWR image repository to ensure that all nodes in the Kubernetes cluster can pull them.

      For details about how to push an image, see Pushing an Image.

Procedure

  1. Install Envoy Gateway on a node.

    1. Install Helm. Helm 3.19.3 is used as an example.
      curl -O https://get.helm.sh/helm-v3.19.3-linux-amd64.tar.gz
      tar xvf helm-v3.19.3-linux-amd64.tar.gz 
      cp ./linux-amd64/helm /usr/local/bin/ 
      helm version

      If information similar to the following is displayed, the tool has been installed:

      version.BuildInfo{Version:"v3.19.3", GitCommit:"0707f566a3f4ced24009ef14d67fe0ce69db****", GitTreeState:"clean", GoVersion:"go1.24.10"}
    2. Obtain the Helm template package.
      helm fetch https://helm.ngc.nvidia.com/nvidia/charts/nvidia-dra-driver-gpu-25.8.0.tgz
      tar xvf nvidia-dra-driver-gpu-25.8.0.tgz
      cd nvidia-dra-driver-gpu
    3. Modify the image information in the values.yaml file.
      vi values.yaml

      Replace the default image with the one that has been pushed to SWR.

      docker.io/envoyproxy/gateway-dev:latest
      docker.io/envoyproxy/ratelimit:master
    4. Prepare the Envoy Gateway configuration file.
      1. Create a basic configuration file named envoy-gateway-values.yaml.
        # Copyright Envoy AI Gateway Authors
        # SPDX-License-Identifier: Apache-2.0
        # The full text of the Apache license is available in the LICENSE file at
        # the root of the repo.
        
        # This file contains the base Envoy Gateway helm values needed for AI Gateway integration.
        # This is the minimal configuration that all AI Gateway deployments need.
        #
        # Use this file when installing Envoy Gateway with:
        #   helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
        #     --version v0.0.0-latest \
        #     --namespace envoy-gateway-system \
        #     --create-namespace \
        #     -f envoy-gateway-values.yaml
        #
        # For additional features, combine with addon values files:
        #   -f envoy-gateway-values.yaml -f examples/token_ratelimit/envoy-gateway-values-addon.yaml
        #   -f envoy-gateway-values.yaml -f examples/inference-pool/envoy-gateway-values-addon.yaml
        
        config:
          envoyGateway:
            gateway:
              controllerName: gateway.envoyproxy.io/gatewayclass-controller
            logging:
              level:
                default: info
            provider:
              type: Kubernetes
            extensionApis:
              # Not strictly required, but recommended for backward/future compatibility.
              enableEnvoyPatchPolicy: true
              # Required: Enable Backend API for AI service backends.
              enableBackend: true
            # Required: AI Gateway needs to fine-tune xDS resources generated by Envoy Gateway.
            extensionManager:
              hooks:
                xdsTranslator:
                  translation:
                    listener:
                      includeAll: true
                    route:
                      includeAll: true
                    cluster:
                      includeAll: true
                    secret:
                      includeAll: true
                  post:
                    - Translation
                    - Cluster
                    - Route
              service:
                fqdn:
                  # IMPORTANT: Update this to match your AI Gateway controller service
                  # Format: <service-name>.<namespace>.svc.cluster.local
                  # Default if you followed the installation steps above:
                  hostname: ai-gateway-controller.envoy-ai-gateway-system.svc.cluster.local
                  port: 1063
      2. Create a plug-in configuration file named envoy-gateway-values-addon.yaml.
        # Copyright Envoy AI Gateway Authors
        # SPDX-License-Identifier: Apache-2.0
        # The full text of the Apache license is available in the LICENSE file at
        # the root of the repo.
        
        # This addon file adds InferencePool support to Envoy Gateway.
        # Use this in combination with the base envoy-gateway-values.yaml:
        #
        #   helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \
        #     --version v0.0.0-latest \
        #     --namespace envoy-gateway-system \
        #     --create-namespace \
        #     -f ../../manifests/envoy-gateway-values.yaml \
        #     -f envoy-gateway-values-addon.yaml
        #
        # You can also combine with rate limiting:
        #   -f ../../manifests/envoy-gateway-values.yaml \
        #   -f ../token_ratelimit/envoy-gateway-values-addon.yaml \
        #   -f envoy-gateway-values-addon.yaml
        
        config:
          envoyGateway:
            extensionManager:
              # Enable InferencePool custom resource support
              backendResources:
                - group: inference.networking.k8s.io
                  kind: InferencePool
                  version: v1
    5. Install Envoy Gateway.
      helm upgrade -i eg . \
        --version v0.0.0-latest \
        --namespace envoy-gateway-system \
        --create-namespace \
        -f envoy-gateway-values.yaml \
        -f envoy-gateway-values-addon.yaml

      If the value of STATUS is deployed in the command output, the tool has been installed.

    6. Verify the deployment status on the console.
      1. Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the status of envoy-gateway is Running.
      2. On the Services tab, check whether the Service associated with envoy-gateway has been created properly.

  2. Obtain and install the InferencePool CRD.

    wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml
    kubectl apply -f manifests.yaml

  3. Install Envoy AI Gateway.

    1. Obtain and install the Envoy AI Gateway CRD.
      # Obtain the Helm package of the Envoy AI Gateway CRD.
      helm pull oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest
      
      # Install the Envoy AI Gateway CRD.
      cd ai-gateway-crds-helm
      
      helm upgrade -i aieg-crd . \
        --version v0.0.0-latest \
        --namespace envoy-ai-gateway-system \
        --create-namespace

      If the value of STATUS is deployed in the command output, the tool has been installed.

    2. Obtain the Helm package of the Envoy AI Gateway controller.
      helm pull oci://docker.io/envoyproxy/ai-gateway-helm --version v0.0.0-latest
    3. Modify the image information in the values.yaml file.
      vi values.yaml

      Replace the default image with the one that has been pushed to SWR.

      docker.io/envoyproxy/ai-gateway-extproc
      docker.io/envoyproxy/ai-gateway-controller
    4. Install the Envoy AI Gateway controller.
      helm upgrade -i aieg . \
        --version v0.0.0-latest \
        --namespace envoy-ai-gateway-system \
        --create-namespace

      If the value of STATUS is deployed in the command output, the tool has been installed.

    5. Verify the deployment status on the console.
      1. Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the status of ai-gateway-controller is Running.
      2. On the Services tab, check whether the Service associated with ai-gateway-controller has been created properly.

  4. Deploy the workload and test the Gateway capability.

    1. Obtain and deploy the simulated vLLM model (Llama3-8b).
      1. Obtain the configuration file.
        # vLLM simulation backend
        wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/vllm/sim-deployment.yaml
        # InferenceObjective
        wget https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/v1.0.1/config/manifests/inferenceobjective.yaml
        # InferencePool resources
        wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/inferencepool-resources.yaml
      2. Modify the image information in the sim-deployment.yaml file.
        vi sim-deployment.yaml

        Replace the default image with the one that has been pushed to SWR.

        ghcr.io/llm-d/llm-d-inference-sim:v0.4.0
      3. Modify the image information in the inferencepool-resources.yaml file.
        vi inferencepool-resources.yaml

        Replace the default image with the one that has been pushed to SWR.

        registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
    2. Obtain and deploy the simulated Mistral.

      Replace the image name with the image prepared in the prerequisites.

      docker.io/envoyproxy/ai-gateway-testupstream:latest
      registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1

      The following is a code example:

      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mistral-upstream
        namespace: default
      spec:
        replicas: 3
        selector:
          matchLabels:
            app: mistral-upstream
        template:
          metadata:
            labels:
              app: mistral-upstream
          spec:
            containers:
              - name: testupstream
                image: docker.io/envoyproxy/ai-gateway-testupstream:latest
                imagePullPolicy: IfNotPresent
                ports:
                  - containerPort: 8080
                env:
                  - name: TESTUPSTREAM_ID
                    value: test
                readinessProbe:
                  httpGet:
                    path: /health
                    port: 8080
                  initialDelaySeconds: 1
                  periodSeconds: 1
      ---
      apiVersion: inference.networking.k8s.io/v1
      kind: InferencePool
      metadata:
        name: mistral
        namespace: default
      spec:
        targetPorts:
          - number: 8080
        selector:
          matchLabels:
            app: mistral-upstream
        endpointPickerRef:
          name: mistral-epp
          port:
            number: 9002
      ---
      apiVersion: inference.networking.x-k8s.io/v1alpha2
      kind: InferenceObjective
      metadata:
        name: mistral
        namespace: default
      spec:
        priority: 10
        poolRef:
          # Bind the InferenceObjective to the InferencePool.
          name: mistral
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: mistral-epp
        namespace: default
      spec:
        selector:
          app: mistral-epp
        ports:
          - protocol: TCP
            port: 9002
            targetPort: 9002
            appProtocol: http2
        type: ClusterIP
      ---
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: mistral-epp
        namespace: default
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mistral-epp
        namespace: default
        labels:
          app: mistral-epp
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: mistral-epp
        template:
          metadata:
            labels:
              app: mistral-epp
          spec:
            serviceAccountName: mistral-epp
            # Conservatively, this timeout should mirror the longest grace period of the pods within the pool
            terminationGracePeriodSeconds: 130
            containers:
              - name: epp
                image: registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
                imagePullPolicy: IfNotPresent
                args:
                  - --pool-name
                  - "mistral"
                  - "--pool-namespace"
                  - "default"
                  - --v
                  - "4"
                  - --zap-encoder
                  - "json"
                  - --grpc-port
                  - "9002"
                  - --grpc-health-port
                  - "9003"
                  - "--config-file"
                  - "/config/default-plugins.yaml"
                ports:
                  - containerPort: 9002
                  - containerPort: 9003
                  - name: metrics
                    containerPort: 9090
                livenessProbe:
                  grpc:
                    port: 9003
                    service: inference-extension
                  initialDelaySeconds: 5
                  periodSeconds: 10
                readinessProbe:
                  grpc:
                    port: 9003
                    service: inference-extension
                  initialDelaySeconds: 5
                  periodSeconds: 10
                volumeMounts:
                  - name: plugins-config-volume
                    mountPath: "/config"
            volumes:
              - name: plugins-config-volume
                configMap:
                  name: plugins-config
      ---
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: plugins-config
        namespace: default
      data:
        default-plugins.yaml: |
          apiVersion: inference.networking.x-k8s.io/v1alpha1
          kind: EndpointPickerConfig
          plugins:
          - type: queue-scorer
          - type: kv-cache-utilization-scorer
          - type: prefix-cache-scorer
          schedulingProfiles:
          - name: default
            plugins:
            - pluginRef: queue-scorer
            - pluginRef: kv-cache-utilization-scorer
            - pluginRef: prefix-cache-scorer
      ---
      kind: Role
      apiVersion: rbac.authorization.k8s.io/v1
      metadata:
        name: pod-read
        namespace: default
      rules:
        - apiGroups: ["inference.networking.x-k8s.io"]
          resources: ["inferenceobjectives", "inferencepools"]
          verbs: ["get", "watch", "list"]
        - apiGroups: ["inference.networking.k8s.io"]
          resources: ["inferencepools"]
          verbs: ["get", "watch", "list"]
        - apiGroups: [""]
          resources: ["pods"]
          verbs: ["get", "watch", "list"]
      ---
      kind: RoleBinding
      apiVersion: rbac.authorization.k8s.io/v1
      metadata:
        name: pod-read-binding
        namespace: default
      subjects:
        - kind: ServiceAccount
          name: mistral-epp
          namespace: default
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: Role
        name: pod-read
      ---
      kind: ClusterRole
      apiVersion: rbac.authorization.k8s.io/v1
      metadata:
        name: auth-reviewer
      rules:
        - apiGroups:
            - authentication.k8s.io
          resources:
            - tokenreviews
          verbs:
            - create
        - apiGroups:
            - authorization.k8s.io
          resources:
            - subjectaccessreviews
          verbs:
            - create
      ---
      kind: ClusterRoleBinding
      apiVersion: rbac.authorization.k8s.io/v1
      metadata:
        name: auth-reviewer-binding
      subjects:
        - kind: ServiceAccount
          name: mistral-epp
          namespace: default
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: auth-reviewer
    3. Obtain and use an AIServiceBackend to deploy a traditional backend.

      Replace the image name with the image prepared in the prerequisites.

      docker.io/envoyproxy/ai-gateway-testupstream:latest

      The following is a code example:

      apiVersion: gateway.envoyproxy.io/v1alpha1
      kind: Backend
      metadata:
        name: envoy-ai-gateway-basic-testupstream
        namespace: default
      spec:
        endpoints:
          - fqdn:
              hostname: envoy-ai-gateway-basic-testupstream.default.svc.cluster.local
              port: 80
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: envoy-ai-gateway-basic-testupstream
        namespace: default
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: envoy-ai-gateway-basic-testupstream
        template:
          metadata:
            labels:
              app: envoy-ai-gateway-basic-testupstream
          spec:
            containers:
              - name: testupstream
                image: docker.io/envoyproxy/ai-gateway-testupstream:latest
                imagePullPolicy: IfNotPresent
                ports:
                  - containerPort: 8080
                env:
                  - name: TESTUPSTREAM_ID
                    value: test
                readinessProbe:
                  httpGet:
                    path: /health
                    port: 8080
                  initialDelaySeconds: 1
                  periodSeconds: 1
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: envoy-ai-gateway-basic-testupstream
        namespace: default
      spec:
        selector:
          app: envoy-ai-gateway-basic-testupstream
        ports:
          - protocol: TCP
            port: 80
            targetPort: 8080
        type: ClusterIP
    4. Deploy the Gateway.

      To ensure the Gateway functions properly on the intranet, customize the Envoy proxy configuration and use a NodePort Service so you can quickly verify the Gateway routing capabilities.

      Replace the image name with the image prepared in the prerequisites.

      docker.io/envoyproxy/envoy:distroless-dev

      The following is a code example:

      apiVersion: gateway.networking.k8s.io/v1
      kind: GatewayClass
      metadata:
        name: inference-pool-with-aigwroute
      spec:
        controllerName: gateway.envoyproxy.io/gatewayclass-controller
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: nodeport-config
          namespace: envoy-gateway-system
      ---
      apiVersion: gateway.networking.k8s.io/v1
      kind: Gateway
      metadata:
        name: inference-pool-with-aigwroute
        namespace: default
      spec:
        gatewayClassName: inference-pool-with-aigwroute
        listeners:
          - name: http
            protocol: HTTP
            port: 80
      ---
      apiVersion: aigateway.envoyproxy.io/v1alpha1
      kind: AIGatewayRoute
      metadata:
        name: inference-pool-with-aigwroute
        namespace: default
      spec:
        parentRefs:
          - name: inference-pool-with-aigwroute
            kind: Gateway
            group: gateway.networking.k8s.io
        rules:
          # Route for vLLM Llama model via InferencePool
          - matches:
              - headers:
                  - type: Exact
                    name: x-ai-eg-model
                    value: meta-llama/Llama-3.1-8B-Instruct
            backendRefs:
              - group: inference.networking.k8s.io
                kind: InferencePool
                name: vllm-llama3-8b-instruct
          # Route for Mistral model via InferencePool
          - matches:
              - headers:
                  - type: Exact
                    name: x-ai-eg-model
                    value: mistral:latest
            backendRefs:
              - group: inference.networking.k8s.io
                kind: InferencePool
                name: mistral
          # Route for traditional backend (non-InferencePool)
          - matches:
              - headers:
                  - type: Exact
                    name: x-ai-eg-model
                    value: some-cool-self-hosted-model
            backendRefs:
              - name: envoy-ai-gateway-basic-testupstream
    5. Verify the deployment status.
      1. Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the statuses of all workloads are Running.

      2. On the Services tab, check whether the needed Services have been created properly.

  5. Test the Gateway's capability of routing requests to different models or backends.

    1. Obtain the access address of the Gateway.
      1. Check the Gateway's endpoint.
        kubectl get gateway inference-pool-with-aigwroute -o jsonpath='{.status.addresses[0].value}'
      2. Combine the obtained endpoint with the NodePort Service. The format is http://[endpoint]:[NodePort]. Use this complete address for all subsequent tests.
    2. Test the routing capability.
      1. Test the Llama-3 model route.

        Verify the route forwarding from the Gateway to Llama-3.

        curl -H "Content-Type: application/json" \
          -d '{
                "model": "meta-llama/Llama-3.1-8B-Instruct",
                "messages": [
                    {
                        "role": "user",
                        "content": "Hi. Say this is a test"
                    }
                ]
            }' \
          http://$GATEWAY_IP/v1/chat/completions
      2. Test the Mistral model route.

        Verify the route forwarding from the Gateway to Mistral.

        curl -H "Content-Type: application/json" \
          -d '{
                "model": "mistral:latest",
                "messages": [
                    {
                        "role": "user",
                        "content": "Hi. Say this is a test"
                    }
                ]
            }' \
          http://$GATEWAY_IP/v1/chat/completions

        If following similar to the following is returned, the model is working:

        {"choices":[{"message":{"content":"This is a test.","role":"assistant"}}]}
      3. Test the common backend load balancer route.

        Verify the route forwarding from the Gateway to the custom backend load balancer.

        curl -H "Content-Type: application/json" \
          -d '{
                "model": "some-cool-self-hosted-model",
                "messages": [
                    {
                        "role": "user",
                        "content": "Hi. Say this is a test"
                    }
                ]
            }' \
          http://$GATEWAY_IP/v1/chat/completions

        If following similar to the following is returned, the model is working:

        {"choices":[{"message":{"role":"assistant","content":"I am the captain of my soul."}}]}