文档首页/ 云容器引擎 CCE/ 用户指南/ 网络/ 网关API（Gateway API）/ 通过Envoy AI Gateway实现大模型推理流量路由

更新时间：2026-06-05 GMT+08:00

通过Envoy AI Gateway实现大模型推理流量路由

Envoy AI Gateway是一款专为大语言模型（LLM）及AI应用场景设计的云原生流量网关插件。它基于高性能的Envoy代理构建，作为AI流量的数据面组件，为Kubernetes集群中的AI推理服务提供统一的流量路由、负载均衡及流量治理功能。

借助该插件，您可以无缝代理并管理访问各种异构大模型（如OpenAI、文心一言、通义千问及本地化部署的开源大模型）的API请求，提升AI应用架构的可靠性与治理效率。

本文通过示例为您介绍，如何使用AI Gateway能力实现根据模型名称的路由能力。

工作原理

Envoy AI Gateway采用控制面与数据面分离的架构：

控制面 (Control Plane)：
- 包含Envoy Gateway Controller与AI Gateway Controller。
- 负责监听用户提交的AI Gateway资源（如AIGatewayRoute），生成配置。
- 通过Extension server协议与Envoy通信，在配置下发前对xDS配置进行微调，注入AI特定的路由逻辑。
数据面 (Data Plane)：
- 核心组件为Envoy Proxy。
- 部署方式通常为Sidecar模式或直接部署为独立网关。
- 内置AI Gateway ExtProc (External Processing)，负责处理AI特有的业务逻辑（如KV Cache 路由、多模型负载均衡等）。

前提条件

已创建v1.32及以上版本的集群。
当前集群已安装Envoy Gateway插件，并打开了“AI推理网关”。

操作步骤

若节点无法访问外网，需自行准备镜像并替换其中的镜像。

获取并部署模拟vLLM模型（Llama3-8b）。

# vLLM simulation backend
wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/vllm/sim-deployment.yaml
# InferenceObjective
wget https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/v1.0.1/config/manifests/inferenceobjective.yaml
# InferencePool resources
wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/inferencepool-resources.yaml
# 应用所有资源
kubectl apply -f .

获取并部署模拟Mistral。

创建mistral-inference-deploy.yaml文件。

apiVersion: v1
kind: Service
metadata:
  name: mistral-upstream
  namespace: default
spec:
  selector:
    app: mistral-upstream
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080
  # The headless service allows the IP addresses of the pods to be resolved via the Service DNS.
  clusterIP: None
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-upstream
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mistral-upstream
  template:
    metadata:
      labels:
        app: mistral-upstream
    spec:
      containers:
        - name: testupstream
          image: docker.io/envoyproxy/ai-gateway-testupstream:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          env:
            - name: TESTUPSTREAM_ID
              value: test
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 1
            periodSeconds: 1
---
apiVersion: inference.networking.k8s.io/v1
kind: InferencePool
metadata:
  name: mistral
  namespace: default
spec:
  targetPorts:
    - number: 8080
  selector:
    matchLabels:
      app: mistral-upstream
  endpointPickerRef:
    name: mistral-epp
    port:
      number: 9002
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceObjective
metadata:
  name: mistral
  namespace: default
spec:
  priority: 10
  poolRef:
    # Bind the InferenceObjective to the InferencePool.
    name: mistral
---
apiVersion: v1
kind: Service
metadata:
  name: mistral-epp
  namespace: default
spec:
  selector:
    app: mistral-epp
  ports:
    - protocol: TCP
      port: 9002
      targetPort: 9002
      appProtocol: http2
  type: ClusterIP
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: mistral-epp
  namespace: default
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mistral-epp
  namespace: default
  labels:
    app: mistral-epp
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mistral-epp
  template:
    metadata:
      labels:
        app: mistral-epp
    spec:
      serviceAccountName: mistral-epp
      # Conservatively, this timeout should mirror the longest grace period of the pods within the pool
      terminationGracePeriodSeconds: 130
      containers:
        - name: epp
          image: registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
          imagePullPolicy: IfNotPresent
          args:
            - --pool-name
            - "mistral"
            - "--pool-namespace"
            - "default"
            - --v
            - "4"
            - --zap-encoder
            - "json"
            - --grpc-port
            - "9002"
            - --grpc-health-port
            - "9003"
            - "--config-file"
            - "/config/default-plugins.yaml"
          ports:
            - containerPort: 9002
            - containerPort: 9003
            - name: metrics
              containerPort: 9090
          livenessProbe:
            grpc:
              port: 9003
              service: inference-extension
            initialDelaySeconds: 5
            periodSeconds: 10
          readinessProbe:
            grpc:
              port: 9003
              service: inference-extension
            initialDelaySeconds: 5
            periodSeconds: 10
          volumeMounts:
            - name: plugins-config-volume
              mountPath: "/config"
      volumes:
        - name: plugins-config-volume
          configMap:
            name: plugins-config
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: plugins-config
  namespace: default
data:
  default-plugins.yaml: |
    apiVersion: inference.networking.x-k8s.io/v1alpha1
    kind: EndpointPickerConfig
    plugins:
    - type: queue-scorer
    - type: kv-cache-utilization-scorer
    - type: prefix-cache-scorer
    schedulingProfiles:
    - name: default
      plugins:
      - pluginRef: queue-scorer
      - pluginRef: kv-cache-utilization-scorer
      - pluginRef: prefix-cache-scorer
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pod-read
  namespace: default
rules:
  - apiGroups: ["inference.networking.x-k8s.io"]
    resources: ["inferenceobjectives", "inferencepools"]
    verbs: ["get", "watch", "list"]
  - apiGroups: ["inference.networking.k8s.io"]
    resources: ["inferencepools"]
    verbs: ["get", "watch", "list"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "watch", "list"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pod-read-binding
  namespace: default
subjects:
  - kind: ServiceAccount
    name: mistral-epp
    namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: pod-read
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: auth-reviewer
rules:
  - apiGroups:
      - authentication.k8s.io
    resources:
      - tokenreviews
    verbs:
      - create
  - apiGroups:
      - authorization.k8s.io
    resources:
      - subjectaccessreviews
    verbs:
      - create
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: auth-reviewer-binding
subjects:
  - kind: ServiceAccount
    name: mistral-epp
    namespace: default
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: auth-reviewer

部署模拟Mistral服务。

kubectl apply -f mistral-inference-deploy.yaml

获取并用AIServiceBackend部署一个传统后端。

创建backend-deployment.yaml文件。

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIServiceBackend
metadata:
  name: envoy-ai-gateway-basic-testupstream
  namespace: default
spec:
  schema:
    name: OpenAI
  backendRef:
    name: envoy-ai-gateway-basic-testupstream
    kind: Backend
    group: gateway.envoyproxy.io
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: Backend
metadata:
  name: envoy-ai-gateway-basic-testupstream
  namespace: default
spec:
  endpoints:
    - fqdn:
        hostname: envoy-ai-gateway-basic-testupstream.default.svc.cluster.local
        port: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: envoy-ai-gateway-basic-testupstream
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: envoy-ai-gateway-basic-testupstream
  template:
    metadata:
      labels:
        app: envoy-ai-gateway-basic-testupstream
    spec:
      containers:
        - name: testupstream
          image: docker.io/envoyproxy/ai-gateway-testupstream:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 8080
          env:
            - name: TESTUPSTREAM_ID
              value: test
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 1
            periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
  name: envoy-ai-gateway-basic-testupstream
  namespace: default
spec:
  selector:
    app: envoy-ai-gateway-basic-testupstream
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  type: ClusterIP

部署传统后端服务。

kubectl apply -f backend-deployment.yaml

部署Gateway。

创建ai-gateway-config.yaml文件。

apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
  name: nodeport-config
  namespace: envoy-gateway-system
spec:
  provider:
    type: Kubernetes
    kubernetes:
      envoyService:
        type: NodePort
      envoyDeployment:
        container:
          image: docker.io/envoyproxy/envoy:distroless-dev
---

apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: inference-pool-with-aigwroute
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller
  parametersRef:
    group: gateway.envoyproxy.io
    kind: EnvoyProxy
    name: nodeport-config
    namespace: envoy-gateway-system
---
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-pool-with-aigwroute
  namespace: default
spec:
  gatewayClassName: inference-pool-with-aigwroute
  listeners:
    - name: http
      protocol: HTTP
      port: 80
---
apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: inference-pool-with-aigwroute
  namespace: default
spec:
  parentRefs:
    - name: inference-pool-with-aigwroute
      kind: Gateway
      group: gateway.networking.k8s.io
  rules:
    # Route for vLLM Llama model via InferencePool
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: meta-llama/Llama-3.1-8B-Instruct
      backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: vllm-llama3-8b-instruct
    # Route for Mistral model via InferencePool
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: mistral:latest
      backendRefs:
        - group: inference.networking.k8s.io
          kind: InferencePool
          name: mistral
    # Route for traditional backend (non-InferencePool)
    - matches:
        - headers:
            - type: Exact
              name: x-ai-eg-model
              value: some-cool-self-hosted-model
      backendRefs:
        - name: envoy-ai-gateway-basic-testupstream

部署Gateway。
```
kubectl apply -f ai-gateway-config.yaml
```

验证部署状态。
1. 在目标集群的“工作负载”页面的“无状态负载”页签，确认全部工作负载状态为“运行中”。
2. 在目标集群的“服务”页面的“服务”页签，确认对应的服务已正确创建。
  在该页面获取并记录相应服务的访问地址和NodePort端口号，并按照格式“[获取的访问地址]:[NodePort端口]”进行拼接。在后续的所有测试中，需将配置中出现的 `$GATEWAY_IP` 替换为该地址。

测试Gateway路由至不同模型/后端的能力。

测试Llama-3模型路由。

执行以下命令，验证网关对Llama3模型的路由转发。

curl -H "Content-Type: application/json" \
  -d '{
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [
            {
                "role": "user",
                "content": "Hi. Say this is a test"
            }
        ]
    }' \
  http://$GATEWAY_IP/v1/chat/completions

正常会返回content信息，表示模型运行中。

{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"The temperature there is twenty-five degrees centigrade. Give a man a fish and you feed him for a day; Teach a man to fish",role":"assistant"}}],"created":1767755896,"do_remote_decode":false,"do_remote_prefill":false,"id":"chatcmp-561ca69e-9716-411f-9656-7a96d9******","model":"meta-llama/llama-3.1-8B-Instruct","object":"chat.completion","remote_block_id":"","remote_engine_id":"","remote_host":"","remote_port":0,"usage":{"completion_tokens":28,"prompt_tokens":7,"total_tokens":35},

测试Mistral模型路由。

执行以下命令，验证网关对Mistral的路由转发。

curl -H "Content-Type: application/json" \
  -d '{
        "model": "mistral:latest",
        "messages": [
            {
                "role": "user",
                "content": "Hi. Say this is a test"
            }
        ]
    }' \
  http://$GATEWAY_IP/v1/chat/completions

正常会返回content信息，表示模型运行中。

{"choices":[{"message":{"content":"This is a test.","role":"assistant"}}]}

测试普通后端负载路由。

执行以下命令，验证网关对自定义后端负载的路由转发。

curl -H "Content-Type: application/json" \
  -d '{
        "model": "some-cool-self-hosted-model",
        "messages": [
            {
                "role": "user",
                "content": "Hi. Say this is a test"
            }
        ]
    }' \
  http://$GATEWAY_IP/v1/chat/completions

正常会返回content信息，表示模型运行中。

{"choices":[{"message":{"role":"assistant","content":"I am the captain of my soul."}}]}

父主题：网关API（Gateway API）

上一篇：通过Envoy Gateway配置限流

下一篇：Envoy Gateway与NGINX Ingress功能对比

意见反馈

文档内容是否对您有帮助？

有帮助没帮助

提供反馈

提交成功！非常感谢您的反馈，我们会继续努力做到更好！您可在我的云声建议查看反馈及问题处理状态。

系统繁忙，请稍后重试

如您有其它疑问，您也可以通过华为云社区问答频道来与我们联系探讨

云宝助手提问云社区提问

通过Envoy AI Gateway实现大模型推理流量路由

工作原理

前提条件

操作步骤

相关文档

意见反馈

文档内容是否对您有帮助？