文档首页/ 云容器引擎 CCE/ 用户指南/ 网络/ 网关API(Gateway API)/ 通过Envoy AI Gateway实现大模型推理流量路由
更新时间:2026-06-05 GMT+08:00
分享

通过Envoy AI Gateway实现大模型推理流量路由

Envoy AI Gateway是一款专为大语言模型(LLM)及AI应用场景设计的云原生流量网关插件。它基于高性能的Envoy代理构建,作为AI流量的数据面组件,为Kubernetes集群中的AI推理服务提供统一的流量路由、负载均衡及流量治理功能。

借助该插件,您可以无缝代理并管理访问各种异构大模型(如OpenAI、文心一言、通义千问及本地化部署的开源大模型)的API请求,提升AI应用架构的可靠性与治理效率。

本文通过示例为您介绍,如何使用AI Gateway能力实现根据模型名称的路由能力。

工作原理

Envoy AI Gateway采用控制面与数据面分离的架构:

  • 控制面 (Control Plane):
    • 包含Envoy Gateway Controller与AI Gateway Controller。
    • 负责监听用户提交的AI Gateway资源(如AIGatewayRoute),生成配置。
    • 通过Extension server协议与Envoy通信,在配置下发前对xDS配置进行微调,注入AI特定的路由逻辑。
  • 数据面 (Data Plane):
    • 核心组件为Envoy Proxy。
    • 部署方式通常为Sidecar模式或直接部署为独立网关。
    • 内置AI Gateway ExtProc (External Processing),负责处理AI特有的业务逻辑(如KV Cache 路由、多模型负载均衡等)。

前提条件

  • 已创建v1.32及以上版本的集群。
  • 当前集群已安装Envoy Gateway插件,并打开了“AI推理网关”

操作步骤

若节点无法访问外网,需自行准备镜像并替换其中的镜像。

  1. 获取并部署模拟vLLM模型(Llama3-8b)。

    # vLLM simulation backend
    wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/vllm/sim-deployment.yaml
    # InferenceObjective
    wget https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/v1.0.1/config/manifests/inferenceobjective.yaml
    # InferencePool resources
    wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/inferencepool-resources.yaml
    # 应用所有资源
    kubectl apply -f .

  2. 获取并部署模拟Mistral。

    1. 创建mistral-inference-deploy.yaml文件。
      apiVersion: v1
      kind: Service
      metadata:
        name: mistral-upstream
        namespace: default
      spec:
        selector:
          app: mistral-upstream
        ports:
          - protocol: TCP
            port: 8080
            targetPort: 8080
        # The headless service allows the IP addresses of the pods to be resolved via the Service DNS.
        clusterIP: None
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mistral-upstream
        namespace: default
      spec:
        replicas: 3
        selector:
          matchLabels:
            app: mistral-upstream
        template:
          metadata:
            labels:
              app: mistral-upstream
          spec:
            containers:
              - name: testupstream
                image: docker.io/envoyproxy/ai-gateway-testupstream:latest
                imagePullPolicy: IfNotPresent
                ports:
                  - containerPort: 8080
                env:
                  - name: TESTUPSTREAM_ID
                    value: test
                readinessProbe:
                  httpGet:
                    path: /health
                    port: 8080
                  initialDelaySeconds: 1
                  periodSeconds: 1
      ---
      apiVersion: inference.networking.k8s.io/v1
      kind: InferencePool
      metadata:
        name: mistral
        namespace: default
      spec:
        targetPorts:
          - number: 8080
        selector:
          matchLabels:
            app: mistral-upstream
        endpointPickerRef:
          name: mistral-epp
          port:
            number: 9002
      ---
      apiVersion: inference.networking.x-k8s.io/v1alpha2
      kind: InferenceObjective
      metadata:
        name: mistral
        namespace: default
      spec:
        priority: 10
        poolRef:
          # Bind the InferenceObjective to the InferencePool.
          name: mistral
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: mistral-epp
        namespace: default
      spec:
        selector:
          app: mistral-epp
        ports:
          - protocol: TCP
            port: 9002
            targetPort: 9002
            appProtocol: http2
        type: ClusterIP
      ---
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: mistral-epp
        namespace: default
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mistral-epp
        namespace: default
        labels:
          app: mistral-epp
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: mistral-epp
        template:
          metadata:
            labels:
              app: mistral-epp
          spec:
            serviceAccountName: mistral-epp
            # Conservatively, this timeout should mirror the longest grace period of the pods within the pool
            terminationGracePeriodSeconds: 130
            containers:
              - name: epp
                image: registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1
                imagePullPolicy: IfNotPresent
                args:
                  - --pool-name
                  - "mistral"
                  - "--pool-namespace"
                  - "default"
                  - --v
                  - "4"
                  - --zap-encoder
                  - "json"
                  - --grpc-port
                  - "9002"
                  - --grpc-health-port
                  - "9003"
                  - "--config-file"
                  - "/config/default-plugins.yaml"
                ports:
                  - containerPort: 9002
                  - containerPort: 9003
                  - name: metrics
                    containerPort: 9090
                livenessProbe:
                  grpc:
                    port: 9003
                    service: inference-extension
                  initialDelaySeconds: 5
                  periodSeconds: 10
                readinessProbe:
                  grpc:
                    port: 9003
                    service: inference-extension
                  initialDelaySeconds: 5
                  periodSeconds: 10
                volumeMounts:
                  - name: plugins-config-volume
                    mountPath: "/config"
            volumes:
              - name: plugins-config-volume
                configMap:
                  name: plugins-config
      ---
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: plugins-config
        namespace: default
      data:
        default-plugins.yaml: |
          apiVersion: inference.networking.x-k8s.io/v1alpha1
          kind: EndpointPickerConfig
          plugins:
          - type: queue-scorer
          - type: kv-cache-utilization-scorer
          - type: prefix-cache-scorer
          schedulingProfiles:
          - name: default
            plugins:
            - pluginRef: queue-scorer
            - pluginRef: kv-cache-utilization-scorer
            - pluginRef: prefix-cache-scorer
      ---
      kind: Role
      apiVersion: rbac.authorization.k8s.io/v1
      metadata:
        name: pod-read
        namespace: default
      rules:
        - apiGroups: ["inference.networking.x-k8s.io"]
          resources: ["inferenceobjectives", "inferencepools"]
          verbs: ["get", "watch", "list"]
        - apiGroups: ["inference.networking.k8s.io"]
          resources: ["inferencepools"]
          verbs: ["get", "watch", "list"]
        - apiGroups: [""]
          resources: ["pods"]
          verbs: ["get", "watch", "list"]
      ---
      kind: RoleBinding
      apiVersion: rbac.authorization.k8s.io/v1
      metadata:
        name: pod-read-binding
        namespace: default
      subjects:
        - kind: ServiceAccount
          name: mistral-epp
          namespace: default
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: Role
        name: pod-read
      ---
      kind: ClusterRole
      apiVersion: rbac.authorization.k8s.io/v1
      metadata:
        name: auth-reviewer
      rules:
        - apiGroups:
            - authentication.k8s.io
          resources:
            - tokenreviews
          verbs:
            - create
        - apiGroups:
            - authorization.k8s.io
          resources:
            - subjectaccessreviews
          verbs:
            - create
      ---
      kind: ClusterRoleBinding
      apiVersion: rbac.authorization.k8s.io/v1
      metadata:
        name: auth-reviewer-binding
      subjects:
        - kind: ServiceAccount
          name: mistral-epp
          namespace: default
      roleRef:
        apiGroup: rbac.authorization.k8s.io
        kind: ClusterRole
        name: auth-reviewer
    2. 部署模拟Mistral服务。
      kubectl apply -f mistral-inference-deploy.yaml

  3. 获取并用AIServiceBackend部署一个传统后端。

    1. 创建backend-deployment.yaml文件。
      apiVersion: aigateway.envoyproxy.io/v1alpha1
      kind: AIServiceBackend
      metadata:
        name: envoy-ai-gateway-basic-testupstream
        namespace: default
      spec:
        schema:
          name: OpenAI
        backendRef:
          name: envoy-ai-gateway-basic-testupstream
          kind: Backend
          group: gateway.envoyproxy.io
      ---
      apiVersion: gateway.envoyproxy.io/v1alpha1
      kind: Backend
      metadata:
        name: envoy-ai-gateway-basic-testupstream
        namespace: default
      spec:
        endpoints:
          - fqdn:
              hostname: envoy-ai-gateway-basic-testupstream.default.svc.cluster.local
              port: 80
      ---
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: envoy-ai-gateway-basic-testupstream
        namespace: default
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: envoy-ai-gateway-basic-testupstream
        template:
          metadata:
            labels:
              app: envoy-ai-gateway-basic-testupstream
          spec:
            containers:
              - name: testupstream
                image: docker.io/envoyproxy/ai-gateway-testupstream:latest
                imagePullPolicy: IfNotPresent
                ports:
                  - containerPort: 8080
                env:
                  - name: TESTUPSTREAM_ID
                    value: test
                readinessProbe:
                  httpGet:
                    path: /health
                    port: 8080
                  initialDelaySeconds: 1
                  periodSeconds: 1
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: envoy-ai-gateway-basic-testupstream
        namespace: default
      spec:
        selector:
          app: envoy-ai-gateway-basic-testupstream
        ports:
          - protocol: TCP
            port: 80
            targetPort: 8080
        type: ClusterIP
    2. 部署传统后端服务。
      kubectl apply -f backend-deployment.yaml

  4. 部署Gateway。

    1. 创建ai-gateway-config.yaml文件。
      apiVersion: gateway.envoyproxy.io/v1alpha1
      kind: EnvoyProxy
      metadata:
        name: nodeport-config
        namespace: envoy-gateway-system
      spec:
        provider:
          type: Kubernetes
          kubernetes:
            envoyService:
              type: NodePort
            envoyDeployment:
              container:
                image: docker.io/envoyproxy/envoy:distroless-dev
      ---
      
      apiVersion: gateway.networking.k8s.io/v1
      kind: GatewayClass
      metadata:
        name: inference-pool-with-aigwroute
      spec:
        controllerName: gateway.envoyproxy.io/gatewayclass-controller
        parametersRef:
          group: gateway.envoyproxy.io
          kind: EnvoyProxy
          name: nodeport-config
          namespace: envoy-gateway-system
      ---
      apiVersion: gateway.networking.k8s.io/v1
      kind: Gateway
      metadata:
        name: inference-pool-with-aigwroute
        namespace: default
      spec:
        gatewayClassName: inference-pool-with-aigwroute
        listeners:
          - name: http
            protocol: HTTP
            port: 80
      ---
      apiVersion: aigateway.envoyproxy.io/v1alpha1
      kind: AIGatewayRoute
      metadata:
        name: inference-pool-with-aigwroute
        namespace: default
      spec:
        parentRefs:
          - name: inference-pool-with-aigwroute
            kind: Gateway
            group: gateway.networking.k8s.io
        rules:
          # Route for vLLM Llama model via InferencePool
          - matches:
              - headers:
                  - type: Exact
                    name: x-ai-eg-model
                    value: meta-llama/Llama-3.1-8B-Instruct
            backendRefs:
              - group: inference.networking.k8s.io
                kind: InferencePool
                name: vllm-llama3-8b-instruct
          # Route for Mistral model via InferencePool
          - matches:
              - headers:
                  - type: Exact
                    name: x-ai-eg-model
                    value: mistral:latest
            backendRefs:
              - group: inference.networking.k8s.io
                kind: InferencePool
                name: mistral
          # Route for traditional backend (non-InferencePool)
          - matches:
              - headers:
                  - type: Exact
                    name: x-ai-eg-model
                    value: some-cool-self-hosted-model
            backendRefs:
              - name: envoy-ai-gateway-basic-testupstream
    2. 部署Gateway。
      kubectl apply -f ai-gateway-config.yaml

  5. 验证部署状态。

    1. 在目标集群的“工作负载”页面的“无状态负载”页签,确认全部工作负载状态为“运行中”。
    2. 在目标集群的“服务”页面的“服务”页签,确认对应的服务已正确创建。

      在该页面获取并记录相应服务的访问地址和NodePort端口号,并按照格式“[获取的访问地址]:[NodePort端口]”进行拼接。在后续的所有测试中,需将配置中出现的 `$GATEWAY_IP` 替换为该地址。

  6. 测试Gateway路由至不同模型/后端的能力。

    1. 测试Llama-3模型路由。

      执行以下命令,验证网关对Llama3模型的路由转发。

      curl -H "Content-Type: application/json" \
        -d '{
              "model": "meta-llama/Llama-3.1-8B-Instruct",
              "messages": [
                  {
                      "role": "user",
                      "content": "Hi. Say this is a test"
                  }
              ]
          }' \
        http://$GATEWAY_IP/v1/chat/completions

      正常会返回content信息,表示模型运行中。

      {"choices":[{"finish_reason":"stop","index":0,"message":{"content":"The temperature there is twenty-five degrees centigrade. Give a man a fish and you feed him for a day; Teach a man to fish",role":"assistant"}}],"created":1767755896,"do_remote_decode":false,"do_remote_prefill":false,"id":"chatcmp-561ca69e-9716-411f-9656-7a96d9******","model":"meta-llama/llama-3.1-8B-Instruct","object":"chat.completion","remote_block_id":"","remote_engine_id":"","remote_host":"","remote_port":0,"usage":{"completion_tokens":28,"prompt_tokens":7,"total_tokens":35}, 
    2. 测试Mistral模型路由。

      执行以下命令,验证网关对Mistral的路由转发。

      curl -H "Content-Type: application/json" \
        -d '{
              "model": "mistral:latest",
              "messages": [
                  {
                      "role": "user",
                      "content": "Hi. Say this is a test"
                  }
              ]
          }' \
        http://$GATEWAY_IP/v1/chat/completions

      正常会返回content信息,表示模型运行中。

      {"choices":[{"message":{"content":"This is a test.","role":"assistant"}}]}
    3. 测试普通后端负载路由。

      执行以下命令,验证网关对自定义后端负载的路由转发。

      curl -H "Content-Type: application/json" \
        -d '{
              "model": "some-cool-self-hosted-model",
              "messages": [
                  {
                      "role": "user",
                      "content": "Hi. Say this is a test"
                  }
              ]
          }' \
        http://$GATEWAY_IP/v1/chat/completions

      正常会返回content信息,表示模型运行中。

      {"choices":[{"message":{"role":"assistant","content":"I am the captain of my soul."}}]}

相关文档