文档首页> AI开发平台ModelArts> 最佳实践> 推理部署> 第三方推理框架迁移到推理自定义引擎
更新时间:2024-04-30 GMT+08:00

第三方推理框架迁移到推理自定义引擎

背景说明

ModelArts支持第三方的推理框架在ModelArts上部署,本文以TFServing框架、Triton框架为例,介绍如何迁移到推理自定义引擎。

  • TensorFlow Serving是一个灵活、高性能的机器学习模型部署系统,提供模型版本管理、服务回滚等能力。通过配置模型路径、模型端口、模型名称等参数,原生TFServing镜像可以快速启动提供服务,并支持gRPC和HTTP Restful API的访问方式。
  • Triton是一个高性能推理服务框架,提供HTTP/gRPC等多种服务协议,支持TensorFlow、TensorRT、PyTorch、ONNXRuntime等多种推理引擎后端,并且支持多模型并发、动态batch等功能,能够提高GPU的使用率,改善推理服务的性能。

当从第三方推理框架迁移到使用ModelArts推理的AI应用管理和服务管理时,需要对原生第三方推理框架镜像的构建方式做一定的改造,以使用ModelArts推理平台的模型版本管理能力和动态加载模型的部署能力。本案例将指导用户完成原生第三方推理框架镜像到ModelArts推理自定义引擎的改造。自定义引擎的镜像制作完成后,即可以通过AI应用导入对模型版本进行管理,并基于AI应用进行部署和管理服务。

适配和改造的主要工作项如下:

图1 改造工作项

针对不同框架的镜像,可能还需要做额外的适配工作,具体差异请见对应框架的操作步骤。

TFServing框架迁移操作步骤

  1. 增加用户ma-user。

    基于原生"tensorflow/serving:2.8.0"镜像构建,镜像中100的用户组默认已存在,Dockerfile中执行如下命令增加用户ma-user。

    RUN useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user

  2. 通过增加nginx代理,支持https协议。

    协议转换为https之后,对外暴露的端口从tfserving的8501变为8080。

    1. Dockerfile中执行如下命令完成nginx的安装和配置。
      RUN apt-get update && apt-get -y --no-install-recommends install nginx && apt-get clean
      RUN mkdir /home/mind && \
          mkdir -p /etc/nginx/keys && \
          mkfifo /etc/nginx/keys/fifo && \
          chown -R ma-user:100 /home/mind && \
          rm -rf /etc/nginx/conf.d/default.conf && \
          chown -R ma-user:100 /etc/nginx/ && \
          chown -R ma-user:100 /var/log/nginx && \
          chown -R ma-user:100 /var/lib/nginx && \
          sed -i "s#/var/run/nginx.pid#/home/ma-user/nginx.pid#g" /etc/init.d/nginx
      ADD nginx /etc/nginx
      ADD run.sh /home/mind/
      ENTRYPOINT []
      CMD /bin/bash /home/mind/run.sh
    2. 准备nginx目录如下:
      nginx
      ├──nginx.conf
      └──conf.d
             ├── modelarts-model-server.conf
    3. 准备nginx.conf文件内容如下:
      user ma-user 100;
      worker_processes 2;
      pid /home/ma-user/nginx.pid;
      include /etc/nginx/modules-enabled/*.conf;
      events {
          worker_connections 768;
      }
      http {
          ##
          # Basic Settings
          ##
          sendfile on;
          tcp_nopush on;
          tcp_nodelay on;
          types_hash_max_size 2048;
          fastcgi_hide_header X-Powered-By;
          port_in_redirect off;
          server_tokens off;
          client_body_timeout 65s;
          client_header_timeout 65s;
          keepalive_timeout 65s;
          send_timeout 65s;
          # server_names_hash_bucket_size 64;
          # server_name_in_redirect off;
          include /etc/nginx/mime.types;
          default_type application/octet-stream;
          ##
          # SSL Settings
          ##
          ssl_protocols TLSv1.2;
          ssl_prefer_server_ciphers on;
          ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256;
          ##
          # Logging Settings
          ##
          access_log /var/log/nginx/access.log;
          error_log /var/log/nginx/error.log;
          ##
          # Gzip Settings
          ##
          gzip on;
          ##
          # Virtual Host Configs
          ##
          include /etc/nginx/conf.d/modelarts-model-server.conf;
      }
    4. 准备modelarts-model-server.conf配置文件内容如下:
      server {
          client_max_body_size 15M;
          large_client_header_buffers 4 64k;
          client_header_buffer_size 1k;
          client_body_buffer_size 16k;
          ssl_certificate /etc/nginx/ssl/server/server.crt;
          ssl_password_file /etc/nginx/keys/fifo;
          ssl_certificate_key /etc/nginx/ssl/server/server.key;
          # setting for mutual ssl with client
          ##
          # header Settings
          ##
          add_header X-XSS-Protection "1; mode=block";
          add_header X-Frame-Options SAMEORIGIN;
          add_header X-Content-Type-Options nosniff;
          add_header Strict-Transport-Security "max-age=31536000; includeSubdomains;";
          add_header Content-Security-Policy "default-src 'self'";
          add_header Cache-Control "max-age=0, no-cache, no-store, must-revalidate";
          add_header Pragma "no-cache";
          add_header Expires "-1";
          server_tokens off;
          port_in_redirect off;
          fastcgi_hide_header X-Powered-By;
          ssl_session_timeout 2m;
          ##
          # SSL Settings
          ##
          ssl_protocols TLSv1.2;
          ssl_prefer_server_ciphers on;
          ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256;
          listen    0.0.0.0:8080 ssl;
          error_page 502 503 /503.html;
          location /503.html {
              return 503 '{"error_code": "ModelArts.4503","error_msg": "Failed to connect to backend service, please confirm your service is connectable. "}';
          }
          location / {
      #       limit_req zone=mylimit;
      #       limit_req_status 429;
              proxy_pass http://127.0.0.1:8501;
          }
      }
    5. 准备启动脚本。

      启动前先创建ssl证书,然后启动TFServing的启动脚本。

      启动脚本run.sh示例代码如下:

      #!/bin/bash
      mkdir -p /etc/nginx/ssl/server && cd /etc/nginx/ssl/server
      cipherText=$(openssl rand -base64 32)
      openssl genrsa -aes256 -passout pass:"${cipherText}" -out server.key 2048
      openssl rsa -in server.key -passin pass:"${cipherText}" -pubout -out rsa_public.key
      openssl req -new -key server.key -passin pass:"${cipherText}" -out server.csr -subj "/C=CN/ST=GD/L=SZ/O=Huawei/OU=ops/CN=*.huawei.com"
      openssl genrsa -out ca.key 2048
      openssl req -new -x509 -days 3650 -key ca.key -out ca-crt.pem -subj "/C=CN/ST=GD/L=SZ/O=Huawei/OU=dev/CN=ca"
      openssl x509 -req -days 3650 -in server.csr -CA ca-crt.pem -CAkey ca.key -CAcreateserial -out server.crt
      service nginx start &
      echo ${cipherText} > /etc/nginx/keys/fifo
      unset cipherText
      sh /usr/bin/tf_serving_entrypoint.sh

  3. 修改模型默认路径,支持ModelArts推理模型动态加载。

    Dockerfile中执行如下命令修改默认的模型路径。

    ENV MODEL_BASE_PATH /home/mind
    ENV MODEL_NAME model

完整的Dockerfile参考:

FROM tensorflow/serving:2.8.0
RUN useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user
RUN apt-get update && apt-get -y --no-install-recommends install nginx && apt-get clean
RUN mkdir /home/mind && \
    mkdir -p /etc/nginx/keys && \
    mkfifo /etc/nginx/keys/fifo && \
    chown -R ma-user:100 /home/mind && \
    rm -rf /etc/nginx/conf.d/default.conf && \
    chown -R ma-user:100 /etc/nginx/ && \
    chown -R ma-user:100 /var/log/nginx && \
    chown -R ma-user:100 /var/lib/nginx && \
    sed -i "s#/var/run/nginx.pid#/home/ma-user/nginx.pid#g" /etc/init.d/nginx
ADD nginx /etc/nginx
ADD run.sh /home/mind/
ENV MODEL_BASE_PATH /home/mind
ENV MODEL_NAME model
ENTRYPOINT []
CMD /bin/bash /home/mind/run.sh

Triton框架迁移操作步骤

本教程基于nvidia官方提供的nvcr.io/nvidia/tritonserver:23.03-py3镜像进行适配,使用开源大模型llama7b进行推理任务。

  1. 增加用户ma-user。

    Triton镜像中默认已存在id为1000的triton-server用户,需先修改triton-server用户名id后再增加用户ma-user,Dockerfile中执行如下命令。

    RUN usermod -u 1001 triton-server && useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user

  2. 通过增加nginx代理,支持https协议。

    1. Dockerfile中执行如下命令完成nginx的安装和配置。
      RUN apt-get update && apt-get -y --no-install-recommends install nginx && apt-get clean && \
          mkdir /home/mind && \
          mkdir -p /etc/nginx/keys && \
          mkfifo /etc/nginx/keys/fifo && \
          chown -R ma-user:100 /home/mind && \
          rm -rf /etc/nginx/conf.d/default.conf && \
          chown -R ma-user:100 /etc/nginx/ && \
          chown -R ma-user:100 /var/log/nginx && \
          chown -R ma-user:100 /var/lib/nginx && \
          sed -i "s#/var/run/nginx.pid#/home/ma-user/nginx.pid#g" /etc/init.d/nginx
    2. 准备nginx目录如下:
      nginx
      ├──nginx.conf
      └──conf.d
             ├── modelarts-model-server.conf
    3. 准备nginx.conf文件内容如下:
      user ma-user 100;
      worker_processes 2;
      pid /home/ma-user/nginx.pid;
      include /etc/nginx/modules-enabled/*.conf;
      events {
          worker_connections 768;
      }
      http {
          ##
          # Basic Settings
          ##
          sendfile on;
          tcp_nopush on;
          tcp_nodelay on;
          types_hash_max_size 2048;
          fastcgi_hide_header X-Powered-By;
          port_in_redirect off;
          server_tokens off;
          client_body_timeout 65s;
          client_header_timeout 65s;
          keepalive_timeout 65s;
          send_timeout 65s;
          # server_names_hash_bucket_size 64;
          # server_name_in_redirect off;
          include /etc/nginx/mime.types;
          default_type application/octet-stream;
          ##
          # SSL Settings
          ##
          ssl_protocols TLSv1.2;
          ssl_prefer_server_ciphers on;
          ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256;
          ##
          # Logging Settings
          ##
          access_log /var/log/nginx/access.log;
          error_log /var/log/nginx/error.log;
          ##
          # Gzip Settings
          ##
          gzip on;
          ##
          # Virtual Host Configs
          ##
          include /etc/nginx/conf.d/modelarts-model-server.conf;
      }
    4. 准备modelarts-model-server.conf配置文件内容如下:
      server {
          client_max_body_size 15M;
          large_client_header_buffers 4 64k;
          client_header_buffer_size 1k;
          client_body_buffer_size 16k;
          ssl_certificate /etc/nginx/ssl/server/server.crt;
          ssl_password_file /etc/nginx/keys/fifo;
          ssl_certificate_key /etc/nginx/ssl/server/server.key;
          # setting for mutual ssl with client
          ##
          # header Settings
          ##
          add_header X-XSS-Protection "1; mode=block";
          add_header X-Frame-Options SAMEORIGIN;
          add_header X-Content-Type-Options nosniff;
          add_header Strict-Transport-Security "max-age=31536000; includeSubdomains;";
          add_header Content-Security-Policy "default-src 'self'";
          add_header Cache-Control "max-age=0, no-cache, no-store, must-revalidate";
          add_header Pragma "no-cache";
          add_header Expires "-1";
          server_tokens off;
          port_in_redirect off;
          fastcgi_hide_header X-Powered-By;
          ssl_session_timeout 2m;
          ##
          # SSL Settings
          ##
          ssl_protocols TLSv1.2;
          ssl_prefer_server_ciphers on;
          ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256;
          listen    0.0.0.0:8080 ssl;
          error_page 502 503 /503.html;
          location /503.html {
              return 503 '{"error_code": "ModelArts.4503","error_msg": "Failed to connect to backend service, please confirm your service is connectable. "}';
          }
          location / {
      #       limit_req zone=mylimit;
      #       limit_req_status 429;
              proxy_pass http://127.0.0.1:8000;
          }
      }
    5. 准备启动脚本run.sh。

      启动前先创建ssl证书,然后启动Triton的启动脚本。

      #!/bin/bash
      mkdir -p /etc/nginx/ssl/server && cd /etc/nginx/ssl/server
      cipherText=$(openssl rand -base64 32)
      openssl genrsa -aes256 -passout pass:"${cipherText}" -out server.key 2048
      openssl rsa -in server.key -passin pass:"${cipherText}" -pubout -out rsa_public.key
      openssl req -new -key server.key -passin pass:"${cipherText}" -out server.csr -subj "/C=CN/ST=GD/L=SZ/O=Huawei/OU=ops/CN=*.huawei.com"
      openssl genrsa -out ca.key 2048
      openssl req -new -x509 -days 3650 -key ca.key -out ca-crt.pem -subj "/C=CN/ST=GD/L=SZ/O=Huawei/OU=dev/CN=ca"
      openssl x509 -req -days 3650 -in server.csr -CA ca-crt.pem -CAkey ca.key -CAcreateserial -out server.crt
      service nginx start &
      echo ${cipherText} > /etc/nginx/keys/fifo
      unset cipherText
      
      bash /home/mind/model/triton_serving.sh

  3. 编译安装tensorrtllm_backend。

    1. Dockerfile中执行如下命令获取tensorrtllm_backend源码,安装tensorrt、cmake和pytorch等相关依赖,并进行编译安装。
      # get tensortllm_backend source code
      WORKDIR /opt/tritonserver
      RUN apt-get install -y --no-install-recommends rapidjson-dev python-is-python3 git-lfs && \
          git config --global http.sslVerify false && \
          git config --global http.postBuffer 1048576000 && \
          git clone -b v0.5.0 https://github.com/triton-inference-server/tensorrtllm_backend.git --depth 1 && \
          cd tensorrtllm_backend && git lfs install && \
          git config submodule.tensorrt_llm.url https://github.com/NVIDIA/TensorRT-LLM.git && \
          git submodule update --init --recursive --depth 1 && \
          pip3 install -r requirements.txt
      
      # build tensorrtllm_backend
      WORKDIR /opt/tritonserver/tensorrtllm_backend/tensorrt_llm
      RUN sed -i "s/wget/wget --no-check-certificate/g" docker/common/install_tensorrt.sh && \
          bash docker/common/install_tensorrt.sh && \
          export  LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH} && \
          sed -i "s/wget/wget --no-check-certificate/g" docker/common/install_cmake.sh && \
          bash docker/common/install_cmake.sh && \
          export PATH=/usr/local/cmake/bin:$PATH && \
          bash docker/common/install_pytorch.sh pypi && \
          python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt && \
          pip install ./build/tensorrt_llm-0.5.0-py3-none-any.whl && \
          rm -f ./build/tensorrt_llm-0.5.0-py3-none-any.whl && \
          cd ../inflight_batcher_llm && bash scripts/build.sh && \
          mkdir /opt/tritonserver/backends/tensorrtllm && \
          cp ./build/libtriton_tensorrtllm.so /opt/tritonserver/backends/tensorrtllm/ && \
          chown -R ma-user:100 /opt/tritonserver
    2. 准备triton serving的启动脚本triton_serving.sh,llama模型的参考样例如下:
      MODEL_NAME=llama_7b
      MODEL_DIR=/home/mind/model/${MODEL_NAME}
      OUTPUT_DIR=/tmp/llama/7B/trt_engines/fp16/1-gpu/
      MAX_BATCH_SIZE=1
      export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH}
      
      # build tensorrt_llm engine
      cd /opt/tritonserver/tensorrtllm_backend/tensorrt_llm/examples/llama
      python build.py --model_dir ${MODEL_DIR} \
                      --dtype float16 \
                      --remove_input_padding \
                      --use_gpt_attention_plugin float16 \
                      --enable_context_fmha \
                      --use_weight_only \
                      --use_gemm_plugin float16 \
                      --output_dir ${OUTPUT_DIR} \
                      --paged_kv_cache \
                      --max_batch_size ${MAX_BATCH_SIZE}
      
      # set config parameters
      cd /opt/tritonserver/tensorrtllm_backend
      mkdir triton_model_repo
      cp all_models/inflight_batcher_llm/* triton_model_repo/ -r
      
      python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:${MODEL_DIR},tokenizer_type:llama,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1
      python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:${MODEL_DIR},tokenizer_type:llama,triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1
      python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
      python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${OUTPUT_DIR},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:V1,max_queue_delay_microseconds:600
      
      # launch tritonserver
      python3 scripts/launch_triton_server.py --world_size 1 --model_repo=triton_model_repo/
      while true; do sleep 10000; done

      部分参数说明:

      • MODEL_NAME:HuggingFace格式模型权重文件所在OBS文件夹名称。
      • OUTPUT_DIR:通过TensorRT-LLM转换后的模型文件在容器中的路径。

      完整的Dockerfile如下:

      FROM nvcr.io/nvidia/tritonserver:23.03-py3
      
      # add ma-user and install nginx
      RUN usermod -u 1001 triton-server && useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user && \
          apt-get update && apt-get -y --no-install-recommends install nginx && apt-get clean && \
          mkdir /home/mind && \
          mkdir -p /etc/nginx/keys && \
          mkfifo /etc/nginx/keys/fifo && \
          chown -R ma-user:100 /home/mind && \
          rm -rf /etc/nginx/conf.d/default.conf && \
          chown -R ma-user:100 /etc/nginx/ && \
          chown -R ma-user:100 /var/log/nginx && \
          chown -R ma-user:100 /var/lib/nginx && \
          sed -i "s#/var/run/nginx.pid#/home/ma-user/nginx.pid#g" /etc/init.d/nginx
      
      # get tensortllm_backend source code
      WORKDIR /opt/tritonserver
      RUN apt-get install -y --no-install-recommends rapidjson-dev python-is-python3 git-lfs && \
          git config --global http.sslVerify false && \
          git config --global http.postBuffer 1048576000 && \
          git clone -b v0.5.0 https://github.com/triton-inference-server/tensorrtllm_backend.git --depth 1 && \
          cd tensorrtllm_backend && git lfs install && \
          git config submodule.tensorrt_llm.url https://github.com/NVIDIA/TensorRT-LLM.git && \
          git submodule update --init --recursive --depth 1 && \
          pip3 install -r requirements.txt
      
      # build tensorrtllm_backend
      WORKDIR /opt/tritonserver/tensorrtllm_backend/tensorrt_llm
      RUN sed -i "s/wget/wget --no-check-certificate/g" docker/common/install_tensorrt.sh && \
          bash docker/common/install_tensorrt.sh && \
          export  LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH} && \
          sed -i "s/wget/wget --no-check-certificate/g" docker/common/install_cmake.sh && \
          bash docker/common/install_cmake.sh && \
          export PATH=/usr/local/cmake/bin:$PATH && \
          bash docker/common/install_pytorch.sh pypi && \
          python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt && \
          pip install ./build/tensorrt_llm-0.5.0-py3-none-any.whl && \
          rm -f ./build/tensorrt_llm-0.5.0-py3-none-any.whl && \
          cd ../inflight_batcher_llm && bash scripts/build.sh && \
          mkdir /opt/tritonserver/backends/tensorrtllm && \
          cp ./build/libtriton_tensorrtllm.so /opt/tritonserver/backends/tensorrtllm/ && \
          chown -R ma-user:100 /opt/tritonserver
      
      ADD nginx /etc/nginx
      ADD run.sh /home/mind/
      CMD /bin/bash /home/mind/run.sh

      完成镜像构建后,将镜像注册至华为云容器镜像服务SWR中,用于后续在ModelArts上部署推理服务。

  4. 使用适配后的镜像在ModelArts部署在线推理服务。

    1. 在obs中创建model目录,并将triton_serving.sh文件和llama_7b文件夹上传至model目录下,如下图所示。
      图2 上传至model目录
    2. 创建AI应用,源模型来源选择“从对象存储服务(OBS)中选择”,元模型选择至model目录,AI引擎选择Custom,引擎包选择步骤3构建的镜像。
      图3 创建AI应用
    3. 将创建的AI应用部署为在线服务,大模型加载启动的时间一般大于普通的模型创建的服务,请配置合理的“部署超时时间”,避免尚未启动完成被认为超时而导致部署失败。
      图4 部署为在线服务
    4. 调用在线服务进行大模型推理,请求路径填写/v2/models/ensemble/infer,调用样例如下:
      {
          "inputs": [
              {
                  "name": "text_input",
                  "shape": [1, 1],
                  "datatype": "BYTES",
                  "data": ["what is machine learning"]
              },
              {
                  "name": "max_tokens",
                  "shape": [1, 1],
                  "datatype": "UINT32",
                  "data": [64]
              },
              {
                  "name": "bad_words",
                  "shape": [1, 1],
                  "datatype": "BYTES",
                  "data": [""]
              },
              {
                  "name": "stop_words",
                  "shape": [1, 1],
                  "datatype": "BYTES",
                  "data": [""]
              },
              {
                  "name": "pad_id",
                  "shape": [1, 1],
                  "datatype": "UINT32",
                  "data": [2]
              },
              {
                  "name": "end_id",
                  "shape": [1, 1],
                  "datatype": "UINT32",
                  "data": [2]
              }
          ],
          "outputs": [
              {
                  "name": "text_output"
              }
          ]
      }
      • "inputs"中"name"为"text_input"的元素代表输入,"data"为具体输入语句,本示例中为"what is machine learning"。
      • "inputs"中"name"为"max_tokens"的元素代表输出最大tokens数,"data"为具体数值,本示例中为64。
      图5 调用在线服务