Using InferencePool and Envoy Gateway to Build an AI Infrastructure Layer
This solution leverages Envoy Gateway, Envoy AI Gateway, and InferencePool to address the key challenges enterprises face when deploying generative AI services in production environments. These challenges include vendor lock-in, weak security controls, limited cost visibility, and complex O&M. The goal is to build a stable, secure, highly scalable AI infrastructure layer for enterprises.
Background
In the current deployment architecture of large language model (LLM) inference services, enterprises are shifting from traditional microservice architectures to AI-native architectures. As generative AI continues to move to production environments, infrastructure teams are encountering several challenges.
- Vendor lock-in and fragile connectivity: Enterprises often need to integrate with multiple LLM providers, such as OpenAI, Anthropic, and AWS Bedrock, alongside self-built models. Without a unified abstraction layer, switching providers or handling single-point failures can lead to service interruptions. Cross-provider automatic disaster recovery becomes difficult, impacting service continuity.
- Lack of enterprise-grade security isolation: There is no unified access control (RBAC) or rate limiting for AI service invocation. In addition, directly exposing API keys to applications can cause sensitive credential leakage, and the absence of unified identity authentication for egress traffic introduces significant security vulnerabilities.
- "Black-box" cost and performance visibility: Calling LLMs is expensive, and service response latency fluctuates greatly. Traditional monitoring tools cannot deeply analyze token consumption, model usage patterns, or response performance. As a result, enterprises struggle to understand cost structures and performance bottlenecks of generative AI, and cannot effectively optimize and allocate resources.
- Lack of traffic management standards: As LLM inference scales, there is no standardized approach, such as native Kubernetes APIs, to manage model version switching, weighted traffic routing, or complex header-based routing. O&M complexity grows exponentially with the number of models, reducing O&M efficiency.
Solution
To address these challenges, this solution adopts an integrated architecture combining Envoy Gateway, Envoy AI Gateway, and InferencePool, built on Envoy's mature, production-proven agent technology. This architecture provides a stable, secure, highly scalable AI infrastructure layer that eliminates generative AI deployment bottlenecks.
- Standard implementation based on Kubernetes Gateway API
- Advanced traffic management: It uses HTTPRoute for weighted traffic splitting of inference services and supports progressive deployment policies such as blue-green deployment.
- In-depth protocol routing: It can identify OpenAI-compatible protocol headers and enables fine-grained traffic routing based on the model field or custom service headers.
- Native mesh compatibility: It seamlessly integrates with popular service meshes, such as Istio, for end-to-end traffic encryption and centralized governance.
- Cross-provider scalable connectivity
- Multi-source abstraction: It can access to cloud services, such as OpenAI and AWS Bedrock, and enterprise-built InferencePools for a unified interface for calling models centrally.
- Intelligent DR: When a self-built model pool is overloaded or third-party APIs are unavailable, the system automatically switches to the standby model to ensure high service availability and service continuity.
- Enterprise-grade security and compliance
- Upstream authentication: It manages providers' API keys at the gateway layer centrally and isolates the application layer from credentials to reduce leakage risks.
- Fine-grained management and control: It supports policy-based access control and multi-dimensional rate limiting to prevent API abuse and ensure system stability and compliance.
- Comprehensive observability and scalability
- Cost and performance analytics: It traces token consumption, model usage distribution, and response latency in real time, provides key performance indicators (KPIs) for enterprises, and supports resource optimization and cost control.
- Pluggable architecture: It inherits the extension capabilities of Envoy, supports quick development of customized functions, such as request rewriting and custom filters, through plug-ins, and flexibly adapts to the ever-evolving AI technology ecosystem.
For more information about Envoy AI Gateway, see Envoy AI Gateway Overview.
Prerequisites
- A cluster of v1.32 or later is available.
- The needed images are ready.
The needed images are shown below. Download them on a PC that can access the Internet in advance.
- Download the images.
docker pull docker.io/envoyproxy/gateway-dev:latest docker pull docker.io/envoyproxy/ratelimit:master docker pull docker.io/envoyproxy/ai-gateway-extproc docker pull docker.io/envoyproxy/ai-gateway-controller docker pull ghcr.io/llm-d/llm-d-inference-sim:v0.4.0 docker pull registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1 docker pull docker.io/envoyproxy/ai-gateway-testupstream:latest docker pull docker.io/envoyproxy/envoy:distroless-dev
- Push the downloaded images to the SWR image repository to ensure that all nodes in the Kubernetes cluster can pull them.
For details about how to push an image, see Pushing an Image.
- Download the images.
Procedure
- Install Envoy Gateway on a node.
- Install Helm. Helm 3.19.3 is used as an example.
curl -O https://get.helm.sh/helm-v3.19.3-linux-amd64.tar.gz tar xvf helm-v3.19.3-linux-amd64.tar.gz cp ./linux-amd64/helm /usr/local/bin/ helm version
If information similar to the following is displayed, the tool has been installed:
version.BuildInfo{Version:"v3.19.3", GitCommit:"0707f566a3f4ced24009ef14d67fe0ce69db****", GitTreeState:"clean", GoVersion:"go1.24.10"} - Obtain the Helm template package.
helm fetch https://helm.ngc.nvidia.com/nvidia/charts/nvidia-dra-driver-gpu-25.8.0.tgz tar xvf nvidia-dra-driver-gpu-25.8.0.tgz cd nvidia-dra-driver-gpu
- Modify the image information in the values.yaml file.
vi values.yaml
Replace the default image with the one that has been pushed to SWR.
docker.io/envoyproxy/gateway-dev:latest docker.io/envoyproxy/ratelimit:master
- Prepare the Envoy Gateway configuration file.
- Create a basic configuration file named envoy-gateway-values.yaml.
# Copyright Envoy AI Gateway Authors # SPDX-License-Identifier: Apache-2.0 # The full text of the Apache license is available in the LICENSE file at # the root of the repo. # This file contains the base Envoy Gateway helm values needed for AI Gateway integration. # This is the minimal configuration that all AI Gateway deployments need. # # Use this file when installing Envoy Gateway with: # helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \ # --version v0.0.0-latest \ # --namespace envoy-gateway-system \ # --create-namespace \ # -f envoy-gateway-values.yaml # # For additional features, combine with addon values files: # -f envoy-gateway-values.yaml -f examples/token_ratelimit/envoy-gateway-values-addon.yaml # -f envoy-gateway-values.yaml -f examples/inference-pool/envoy-gateway-values-addon.yaml config: envoyGateway: gateway: controllerName: gateway.envoyproxy.io/gatewayclass-controller logging: level: default: info provider: type: Kubernetes extensionApis: # Not strictly required, but recommended for backward/future compatibility. enableEnvoyPatchPolicy: true # Required: Enable Backend API for AI service backends. enableBackend: true # Required: AI Gateway needs to fine-tune xDS resources generated by Envoy Gateway. extensionManager: hooks: xdsTranslator: translation: listener: includeAll: true route: includeAll: true cluster: includeAll: true secret: includeAll: true post: - Translation - Cluster - Route service: fqdn: # IMPORTANT: Update this to match your AI Gateway controller service # Format: <service-name>.<namespace>.svc.cluster.local # Default if you followed the installation steps above: hostname: ai-gateway-controller.envoy-ai-gateway-system.svc.cluster.local port: 1063 - Create a plug-in configuration file named envoy-gateway-values-addon.yaml.
# Copyright Envoy AI Gateway Authors # SPDX-License-Identifier: Apache-2.0 # The full text of the Apache license is available in the LICENSE file at # the root of the repo. # This addon file adds InferencePool support to Envoy Gateway. # Use this in combination with the base envoy-gateway-values.yaml: # # helm upgrade -i eg oci://docker.io/envoyproxy/gateway-helm \ # --version v0.0.0-latest \ # --namespace envoy-gateway-system \ # --create-namespace \ # -f ../../manifests/envoy-gateway-values.yaml \ # -f envoy-gateway-values-addon.yaml # # You can also combine with rate limiting: # -f ../../manifests/envoy-gateway-values.yaml \ # -f ../token_ratelimit/envoy-gateway-values-addon.yaml \ # -f envoy-gateway-values-addon.yaml config: envoyGateway: extensionManager: # Enable InferencePool custom resource support backendResources: - group: inference.networking.k8s.io kind: InferencePool version: v1
- Create a basic configuration file named envoy-gateway-values.yaml.
- Install Envoy Gateway.
helm upgrade -i eg . \ --version v0.0.0-latest \ --namespace envoy-gateway-system \ --create-namespace \ -f envoy-gateway-values.yaml \ -f envoy-gateway-values-addon.yaml
If the value of STATUS is deployed in the command output, the tool has been installed.

- Verify the deployment status on the console.
- Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the status of envoy-gateway is Running.
- On the Services tab, check whether the Service associated with envoy-gateway has been created properly.
- Install Helm. Helm 3.19.3 is used as an example.
- Obtain and install the InferencePool CRD.
wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v1.0.1/manifests.yaml kubectl apply -f manifests.yaml
- Install Envoy AI Gateway.
- Obtain and install the Envoy AI Gateway CRD.
# Obtain the Helm package of the Envoy AI Gateway CRD. helm pull oci://docker.io/envoyproxy/ai-gateway-crds-helm --version v0.0.0-latest # Install the Envoy AI Gateway CRD. cd ai-gateway-crds-helm helm upgrade -i aieg-crd . \ --version v0.0.0-latest \ --namespace envoy-ai-gateway-system \ --create-namespace
If the value of STATUS is deployed in the command output, the tool has been installed.

- Obtain the Helm package of the Envoy AI Gateway controller.
helm pull oci://docker.io/envoyproxy/ai-gateway-helm --version v0.0.0-latest
- Modify the image information in the values.yaml file.
vi values.yaml
Replace the default image with the one that has been pushed to SWR.
docker.io/envoyproxy/ai-gateway-extproc docker.io/envoyproxy/ai-gateway-controller
- Install the Envoy AI Gateway controller.
helm upgrade -i aieg . \ --version v0.0.0-latest \ --namespace envoy-ai-gateway-system \ --create-namespace
If the value of STATUS is deployed in the command output, the tool has been installed.

- Verify the deployment status on the console.
- Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the status of ai-gateway-controller is Running.
- On the Services tab, check whether the Service associated with ai-gateway-controller has been created properly.
- Obtain and install the Envoy AI Gateway CRD.
- Deploy the workload and test the Gateway capability.
- Obtain and deploy the simulated vLLM model (Llama3-8b).
- Obtain the configuration file.
# vLLM simulation backend wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/vllm/sim-deployment.yaml # InferenceObjective wget https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/v1.0.1/config/manifests/inferenceobjective.yaml # InferencePool resources wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/inferencepool-resources.yaml
- Modify the image information in the sim-deployment.yaml file.
vi sim-deployment.yaml
Replace the default image with the one that has been pushed to SWR.
ghcr.io/llm-d/llm-d-inference-sim:v0.4.0
- Modify the image information in the inferencepool-resources.yaml file.
vi inferencepool-resources.yaml
Replace the default image with the one that has been pushed to SWR.
registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
- Obtain the configuration file.
- Obtain and deploy the simulated Mistral.
Replace the image name with the image prepared in the prerequisites.
docker.io/envoyproxy/ai-gateway-testupstream:latest registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
The following is a code example:
apiVersion: apps/v1 kind: Deployment metadata: name: mistral-upstream namespace: default spec: replicas: 3 selector: matchLabels: app: mistral-upstream template: metadata: labels: app: mistral-upstream spec: containers: - name: testupstream image: docker.io/envoyproxy/ai-gateway-testupstream:latest imagePullPolicy: IfNotPresent ports: - containerPort: 8080 env: - name: TESTUPSTREAM_ID value: test readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 1 periodSeconds: 1 --- apiVersion: inference.networking.k8s.io/v1 kind: InferencePool metadata: name: mistral namespace: default spec: targetPorts: - number: 8080 selector: matchLabels: app: mistral-upstream endpointPickerRef: name: mistral-epp port: number: 9002 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceObjective metadata: name: mistral namespace: default spec: priority: 10 poolRef: # Bind the InferenceObjective to the InferencePool. name: mistral --- apiVersion: v1 kind: Service metadata: name: mistral-epp namespace: default spec: selector: app: mistral-epp ports: - protocol: TCP port: 9002 targetPort: 9002 appProtocol: http2 type: ClusterIP --- apiVersion: v1 kind: ServiceAccount metadata: name: mistral-epp namespace: default --- apiVersion: apps/v1 kind: Deployment metadata: name: mistral-epp namespace: default labels: app: mistral-epp spec: replicas: 1 selector: matchLabels: app: mistral-epp template: metadata: labels: app: mistral-epp spec: serviceAccountName: mistral-epp # Conservatively, this timeout should mirror the longest grace period of the pods within the pool terminationGracePeriodSeconds: 130 containers: - name: epp image: registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1 imagePullPolicy: IfNotPresent args: - --pool-name - "mistral" - "--pool-namespace" - "default" - --v - "4" - --zap-encoder - "json" - --grpc-port - "9002" - --grpc-health-port - "9003" - "--config-file" - "/config/default-plugins.yaml" ports: - containerPort: 9002 - containerPort: 9003 - name: metrics containerPort: 9090 livenessProbe: grpc: port: 9003 service: inference-extension initialDelaySeconds: 5 periodSeconds: 10 readinessProbe: grpc: port: 9003 service: inference-extension initialDelaySeconds: 5 periodSeconds: 10 volumeMounts: - name: plugins-config-volume mountPath: "/config" volumes: - name: plugins-config-volume configMap: name: plugins-config --- apiVersion: v1 kind: ConfigMap metadata: name: plugins-config namespace: default data: default-plugins.yaml: | apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: EndpointPickerConfig plugins: - type: queue-scorer - type: kv-cache-utilization-scorer - type: prefix-cache-scorer schedulingProfiles: - name: default plugins: - pluginRef: queue-scorer - pluginRef: kv-cache-utilization-scorer - pluginRef: prefix-cache-scorer --- kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: name: pod-read namespace: default rules: - apiGroups: ["inference.networking.x-k8s.io"] resources: ["inferenceobjectives", "inferencepools"] verbs: ["get", "watch", "list"] - apiGroups: ["inference.networking.k8s.io"] resources: ["inferencepools"] verbs: ["get", "watch", "list"] - apiGroups: [""] resources: ["pods"] verbs: ["get", "watch", "list"] --- kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: pod-read-binding namespace: default subjects: - kind: ServiceAccount name: mistral-epp namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: pod-read --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: auth-reviewer rules: - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: auth-reviewer-binding subjects: - kind: ServiceAccount name: mistral-epp namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: auth-reviewer - Obtain and use an AIServiceBackend to deploy a traditional backend.
Replace the image name with the image prepared in the prerequisites.
docker.io/envoyproxy/ai-gateway-testupstream:latest
The following is a code example:
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: Backend metadata: name: envoy-ai-gateway-basic-testupstream namespace: default spec: endpoints: - fqdn: hostname: envoy-ai-gateway-basic-testupstream.default.svc.cluster.local port: 80 --- apiVersion: apps/v1 kind: Deployment metadata: name: envoy-ai-gateway-basic-testupstream namespace: default spec: replicas: 1 selector: matchLabels: app: envoy-ai-gateway-basic-testupstream template: metadata: labels: app: envoy-ai-gateway-basic-testupstream spec: containers: - name: testupstream image: docker.io/envoyproxy/ai-gateway-testupstream:latest imagePullPolicy: IfNotPresent ports: - containerPort: 8080 env: - name: TESTUPSTREAM_ID value: test readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 1 periodSeconds: 1 --- apiVersion: v1 kind: Service metadata: name: envoy-ai-gateway-basic-testupstream namespace: default spec: selector: app: envoy-ai-gateway-basic-testupstream ports: - protocol: TCP port: 80 targetPort: 8080 type: ClusterIP - Deploy the Gateway.
To ensure the Gateway functions properly on the intranet, customize the Envoy proxy configuration and use a NodePort Service so you can quickly verify the Gateway routing capabilities.
Replace the image name with the image prepared in the prerequisites.
docker.io/envoyproxy/envoy:distroless-dev
The following is a code example:
apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-pool-with-aigwroute spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: nodeport-config namespace: envoy-gateway-system --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-pool-with-aigwroute namespace: default spec: gatewayClassName: inference-pool-with-aigwroute listeners: - name: http protocol: HTTP port: 80 --- apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIGatewayRoute metadata: name: inference-pool-with-aigwroute namespace: default spec: parentRefs: - name: inference-pool-with-aigwroute kind: Gateway group: gateway.networking.k8s.io rules: # Route for vLLM Llama model via InferencePool - matches: - headers: - type: Exact name: x-ai-eg-model value: meta-llama/Llama-3.1-8B-Instruct backendRefs: - group: inference.networking.k8s.io kind: InferencePool name: vllm-llama3-8b-instruct # Route for Mistral model via InferencePool - matches: - headers: - type: Exact name: x-ai-eg-model value: mistral:latest backendRefs: - group: inference.networking.k8s.io kind: InferencePool name: mistral # Route for traditional backend (non-InferencePool) - matches: - headers: - type: Exact name: x-ai-eg-model value: some-cool-self-hosted-model backendRefs: - name: envoy-ai-gateway-basic-testupstream - Verify the deployment status.
- Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the statuses of all workloads are Running.

- On the Services tab, check whether the needed Services have been created properly.

- Access the cluster console. In the navigation pane, choose Workloads. In the right pane, click the Deployments tab and check whether the statuses of all workloads are Running.
- Obtain and deploy the simulated vLLM model (Llama3-8b).
- Test the Gateway's capability of routing requests to different models or backends.
- Obtain the access address of the Gateway.
- Check the Gateway's endpoint.
kubectl get gateway inference-pool-with-aigwroute -o jsonpath='{.status.addresses[0].value}' - Combine the obtained endpoint with the NodePort Service. The format is http://[endpoint]:[NodePort]. Use this complete address for all subsequent tests.
- Check the Gateway's endpoint.
- Test the routing capability.
- Test the Llama-3 model route.
Verify the route forwarding from the Gateway to Llama-3.
curl -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "Hi. Say this is a test" } ] }' \ http://$GATEWAY_IP/v1/chat/completions - Test the Mistral model route.
Verify the route forwarding from the Gateway to Mistral.
curl -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "messages": [ { "role": "user", "content": "Hi. Say this is a test" } ] }' \ http://$GATEWAY_IP/v1/chat/completionsIf following similar to the following is returned, the model is working:
{"choices":[{"message":{"content":"This is a test.","role":"assistant"}}]} - Test the common backend load balancer route.
Verify the route forwarding from the Gateway to the custom backend load balancer.
curl -H "Content-Type: application/json" \ -d '{ "model": "some-cool-self-hosted-model", "messages": [ { "role": "user", "content": "Hi. Say this is a test" } ] }' \ http://$GATEWAY_IP/v1/chat/completionsIf following similar to the following is returned, the model is working:
{"choices":[{"message":{"role":"assistant","content":"I am the captain of my soul."}}]}
- Test the Llama-3 model route.
- Obtain the access address of the Gateway.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot