通过Envoy AI Gateway实现大模型推理流量路由
Envoy AI Gateway是一款专为大语言模型(LLM)及AI应用场景设计的云原生流量网关插件。它基于高性能的Envoy代理构建,作为AI流量的数据面组件,为Kubernetes集群中的AI推理服务提供统一的流量路由、负载均衡及流量治理功能。
借助该插件,您可以无缝代理并管理访问各种异构大模型(如OpenAI、文心一言、通义千问及本地化部署的开源大模型)的API请求,提升AI应用架构的可靠性与治理效率。
本文通过示例为您介绍,如何使用AI Gateway能力实现根据模型名称的路由能力。
工作原理
Envoy AI Gateway采用控制面与数据面分离的架构:
- 控制面 (Control Plane):
- 包含Envoy Gateway Controller与AI Gateway Controller。
- 负责监听用户提交的AI Gateway资源(如AIGatewayRoute),生成配置。
- 通过Extension server协议与Envoy通信,在配置下发前对xDS配置进行微调,注入AI特定的路由逻辑。
- 数据面 (Data Plane):
- 核心组件为Envoy Proxy。
- 部署方式通常为Sidecar模式或直接部署为独立网关。
- 内置AI Gateway ExtProc (External Processing),负责处理AI特有的业务逻辑(如KV Cache 路由、多模型负载均衡等)。
前提条件
- 已创建v1.32及以上版本的集群。
- 当前集群已安装Envoy Gateway插件,并打开了“AI推理网关”。
操作步骤
若节点无法访问外网,需自行准备镜像并替换其中的镜像。
- 获取并部署模拟vLLM模型(Llama3-8b)。
# vLLM simulation backend wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/vllm/sim-deployment.yaml # InferenceObjective wget https://raw.githubusercontent.com/kubernetes-sigs/gateway-api-inference-extension/refs/tags/v1.0.1/config/manifests/inferenceobjective.yaml # InferencePool resources wget https://github.com/kubernetes-sigs/gateway-api-inference-extension/raw/v1.0.1/config/manifests/inferencepool-resources.yaml # 应用所有资源 kubectl apply -f .
- 获取并部署模拟Mistral。
- 创建mistral-inference-deploy.yaml文件。
apiVersion: v1 kind: Service metadata: name: mistral-upstream namespace: default spec: selector: app: mistral-upstream ports: - protocol: TCP port: 8080 targetPort: 8080 # The headless service allows the IP addresses of the pods to be resolved via the Service DNS. clusterIP: None --- apiVersion: apps/v1 kind: Deployment metadata: name: mistral-upstream namespace: default spec: replicas: 3 selector: matchLabels: app: mistral-upstream template: metadata: labels: app: mistral-upstream spec: containers: - name: testupstream image: docker.io/envoyproxy/ai-gateway-testupstream:latest imagePullPolicy: IfNotPresent ports: - containerPort: 8080 env: - name: TESTUPSTREAM_ID value: test readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 1 periodSeconds: 1 --- apiVersion: inference.networking.k8s.io/v1 kind: InferencePool metadata: name: mistral namespace: default spec: targetPorts: - number: 8080 selector: matchLabels: app: mistral-upstream endpointPickerRef: name: mistral-epp port: number: 9002 --- apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceObjective metadata: name: mistral namespace: default spec: priority: 10 poolRef: # Bind the InferenceObjective to the InferencePool. name: mistral --- apiVersion: v1 kind: Service metadata: name: mistral-epp namespace: default spec: selector: app: mistral-epp ports: - protocol: TCP port: 9002 targetPort: 9002 appProtocol: http2 type: ClusterIP --- apiVersion: v1 kind: ServiceAccount metadata: name: mistral-epp namespace: default --- apiVersion: apps/v1 kind: Deployment metadata: name: mistral-epp namespace: default labels: app: mistral-epp spec: replicas: 1 selector: matchLabels: app: mistral-epp template: metadata: labels: app: mistral-epp spec: serviceAccountName: mistral-epp # Conservatively, this timeout should mirror the longest grace period of the pods within the pool terminationGracePeriodSeconds: 130 containers: - name: epp image: registry.k8s.io/gateway-api-inference-extension/epp:v1.0.1 imagePullPolicy: IfNotPresent args: - --pool-name - "mistral" - "--pool-namespace" - "default" - --v - "4" - --zap-encoder - "json" - --grpc-port - "9002" - --grpc-health-port - "9003" - "--config-file" - "/config/default-plugins.yaml" ports: - containerPort: 9002 - containerPort: 9003 - name: metrics containerPort: 9090 livenessProbe: grpc: port: 9003 service: inference-extension initialDelaySeconds: 5 periodSeconds: 10 readinessProbe: grpc: port: 9003 service: inference-extension initialDelaySeconds: 5 periodSeconds: 10 volumeMounts: - name: plugins-config-volume mountPath: "/config" volumes: - name: plugins-config-volume configMap: name: plugins-config --- apiVersion: v1 kind: ConfigMap metadata: name: plugins-config namespace: default data: default-plugins.yaml: | apiVersion: inference.networking.x-k8s.io/v1alpha1 kind: EndpointPickerConfig plugins: - type: queue-scorer - type: kv-cache-utilization-scorer - type: prefix-cache-scorer schedulingProfiles: - name: default plugins: - pluginRef: queue-scorer - pluginRef: kv-cache-utilization-scorer - pluginRef: prefix-cache-scorer --- kind: Role apiVersion: rbac.authorization.k8s.io/v1 metadata: name: pod-read namespace: default rules: - apiGroups: ["inference.networking.x-k8s.io"] resources: ["inferenceobjectives", "inferencepools"] verbs: ["get", "watch", "list"] - apiGroups: ["inference.networking.k8s.io"] resources: ["inferencepools"] verbs: ["get", "watch", "list"] - apiGroups: [""] resources: ["pods"] verbs: ["get", "watch", "list"] --- kind: RoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: pod-read-binding namespace: default subjects: - kind: ServiceAccount name: mistral-epp namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: Role name: pod-read --- kind: ClusterRole apiVersion: rbac.authorization.k8s.io/v1 metadata: name: auth-reviewer rules: - apiGroups: - authentication.k8s.io resources: - tokenreviews verbs: - create - apiGroups: - authorization.k8s.io resources: - subjectaccessreviews verbs: - create --- kind: ClusterRoleBinding apiVersion: rbac.authorization.k8s.io/v1 metadata: name: auth-reviewer-binding subjects: - kind: ServiceAccount name: mistral-epp namespace: default roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: auth-reviewer - 部署模拟Mistral服务。
kubectl apply -f mistral-inference-deploy.yaml
- 创建mistral-inference-deploy.yaml文件。
- 获取并用AIServiceBackend部署一个传统后端。
- 创建backend-deployment.yaml文件。
apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIServiceBackend metadata: name: envoy-ai-gateway-basic-testupstream namespace: default spec: schema: name: OpenAI backendRef: name: envoy-ai-gateway-basic-testupstream kind: Backend group: gateway.envoyproxy.io --- apiVersion: gateway.envoyproxy.io/v1alpha1 kind: Backend metadata: name: envoy-ai-gateway-basic-testupstream namespace: default spec: endpoints: - fqdn: hostname: envoy-ai-gateway-basic-testupstream.default.svc.cluster.local port: 80 --- apiVersion: apps/v1 kind: Deployment metadata: name: envoy-ai-gateway-basic-testupstream namespace: default spec: replicas: 1 selector: matchLabels: app: envoy-ai-gateway-basic-testupstream template: metadata: labels: app: envoy-ai-gateway-basic-testupstream spec: containers: - name: testupstream image: docker.io/envoyproxy/ai-gateway-testupstream:latest imagePullPolicy: IfNotPresent ports: - containerPort: 8080 env: - name: TESTUPSTREAM_ID value: test readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 1 periodSeconds: 1 --- apiVersion: v1 kind: Service metadata: name: envoy-ai-gateway-basic-testupstream namespace: default spec: selector: app: envoy-ai-gateway-basic-testupstream ports: - protocol: TCP port: 80 targetPort: 8080 type: ClusterIP - 部署传统后端服务。
kubectl apply -f backend-deployment.yaml
- 创建backend-deployment.yaml文件。
- 部署Gateway。
- 创建ai-gateway-config.yaml文件。
apiVersion: gateway.envoyproxy.io/v1alpha1 kind: EnvoyProxy metadata: name: nodeport-config namespace: envoy-gateway-system spec: provider: type: Kubernetes kubernetes: envoyService: type: NodePort envoyDeployment: container: image: docker.io/envoyproxy/envoy:distroless-dev --- apiVersion: gateway.networking.k8s.io/v1 kind: GatewayClass metadata: name: inference-pool-with-aigwroute spec: controllerName: gateway.envoyproxy.io/gatewayclass-controller parametersRef: group: gateway.envoyproxy.io kind: EnvoyProxy name: nodeport-config namespace: envoy-gateway-system --- apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: inference-pool-with-aigwroute namespace: default spec: gatewayClassName: inference-pool-with-aigwroute listeners: - name: http protocol: HTTP port: 80 --- apiVersion: aigateway.envoyproxy.io/v1alpha1 kind: AIGatewayRoute metadata: name: inference-pool-with-aigwroute namespace: default spec: parentRefs: - name: inference-pool-with-aigwroute kind: Gateway group: gateway.networking.k8s.io rules: # Route for vLLM Llama model via InferencePool - matches: - headers: - type: Exact name: x-ai-eg-model value: meta-llama/Llama-3.1-8B-Instruct backendRefs: - group: inference.networking.k8s.io kind: InferencePool name: vllm-llama3-8b-instruct # Route for Mistral model via InferencePool - matches: - headers: - type: Exact name: x-ai-eg-model value: mistral:latest backendRefs: - group: inference.networking.k8s.io kind: InferencePool name: mistral # Route for traditional backend (non-InferencePool) - matches: - headers: - type: Exact name: x-ai-eg-model value: some-cool-self-hosted-model backendRefs: - name: envoy-ai-gateway-basic-testupstream - 部署Gateway。
kubectl apply -f ai-gateway-config.yaml
- 创建ai-gateway-config.yaml文件。
- 验证部署状态。
- 测试Gateway路由至不同模型/后端的能力。
- 测试Llama-3模型路由。
curl -H "Content-Type: application/json" \ -d '{ "model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [ { "role": "user", "content": "Hi. Say this is a test" } ] }' \ http://$GATEWAY_IP/v1/chat/completions正常会返回content信息,表示模型运行中。
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"The temperature there is twenty-five degrees centigrade. Give a man a fish and you feed him for a day; Teach a man to fish",role":"assistant"}}],"created":1767755896,"do_remote_decode":false,"do_remote_prefill":false,"id":"chatcmp-561ca69e-9716-411f-9656-7a96d9******","model":"meta-llama/llama-3.1-8B-Instruct","object":"chat.completion","remote_block_id":"","remote_engine_id":"","remote_host":"","remote_port":0,"usage":{"completion_tokens":28,"prompt_tokens":7,"total_tokens":35}, - 测试Mistral模型路由。
curl -H "Content-Type: application/json" \ -d '{ "model": "mistral:latest", "messages": [ { "role": "user", "content": "Hi. Say this is a test" } ] }' \ http://$GATEWAY_IP/v1/chat/completions正常会返回content信息,表示模型运行中。
{"choices":[{"message":{"content":"This is a test.","role":"assistant"}}]} - 测试普通后端负载路由。
curl -H "Content-Type: application/json" \ -d '{ "model": "some-cool-self-hosted-model", "messages": [ { "role": "user", "content": "Hi. Say this is a test" } ] }' \ http://$GATEWAY_IP/v1/chat/completions正常会返回content信息,表示模型运行中。
{"choices":[{"message":{"role":"assistant","content":"I am the captain of my soul."}}]}
- 测试Llama-3模型路由。
