Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ ModelArts/ Best Practices/ Model Inference/ Migrating a Third-Party Inference Framework to a Custom Inference Engine

Migrating a Third-Party Inference Framework to a Custom Inference Engine

Updated on 2024-03-30 GMT+08:00

Context

ModelArts allows the deployment of third-party inference frameworks. This section describes how to migrate TF Serving and Triton to a custom inference engine.

  • TensorFlow Serving (TF Serving) is a flexible, high-performance model deployment system for machine learning. It provides model version management and service rollback capabilities. By configuring parameters such as the model path, model port, and model name, native TF Serving images can quickly start providing services which can be accessed through gRPC and HTTP RESTful APIs.
  • Triton is a high-performance inference service framework developed by NVIDIA. It supports multiple service protocols, including HTTP and gRPC. Additionally, Triton is compatible with various inference engine backends such as TensorFlow, TensorRT, PyTorch, and ONNX Runtime. Notably, it enables multi-model concurrency and dynamic batching, effectively optimizing GPU utilization and enhancing inference service performance.

The migration of a third-party framework to a ModelArts inference framework requires reconstruction of the native third-party framework image. After that, ModelArts model version management and dynamic model loading can be used. This section shows how to complete such a reconstruction. After an image of the custom engine is created, you can use it to create an AI application version and deploy and manage services using the AI application.

The following figure shows the reconstruction items.

Figure 1 Reconstruction items

The reconstruction process may differ for images from various frameworks. For details, see the migration procedure specific to the target framework.

Migrating TF Serving

  1. Add user ma-user.

    The image is built based on the native tensorflow/serving:2.8.0 image. The user group 100 exists in the image by default. Run the following command in the Dockerfile to add user ma-user:

    RUN useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user

  2. Set up an Nginx proxy to support HTTPS.

    After the protocol is converted to HTTPS, the exposed port changes from 8501 of TF Serving to 8080.

    1. Run the following commands in the Dockerfile to install and configure Nginx:
      RUN apt-get update && apt-get -y --no-install-recommends install nginx && apt-get clean
      RUN mkdir /home/mind && \
          mkdir -p /etc/nginx/keys && \
          mkfifo /etc/nginx/keys/fifo && \
          chown -R ma-user:100 /home/mind && \
          rm -rf /etc/nginx/conf.d/default.conf && \
          chown -R ma-user:100 /etc/nginx/ && \
          chown -R ma-user:100 /var/log/nginx && \
          chown -R ma-user:100 /var/lib/nginx && \
          sed -i "s#/var/run/nginx.pid#/home/ma-user/nginx.pid#g" /etc/init.d/nginx
      ADD nginx /etc/nginx
      ADD run.sh /home/mind/
      ENTRYPOINT []
      CMD /bin/bash /home/mind/run.sh
    2. Create the Nginx directory.
      nginx
      ├──nginx.conf
      └──conf.d
             ├── modelarts-model-server.conf
    3. Write the nginx.conf file.
      user ma-user 100;
      worker_processes 2;
      pid /home/ma-user/nginx.pid;
      include /etc/nginx/modules-enabled/*.conf;
      events {
          worker_connections 768;
      }
      http {
          ##
          # Basic Settings
          ##
          sendfile on;
          tcp_nopush on;
          tcp_nodelay on;
          types_hash_max_size 2048;
          fastcgi_hide_header X-Powered-By;
          port_in_redirect off;
          server_tokens off;
          client_body_timeout 65s;
          client_header_timeout 65s;
          keepalive_timeout 65s;
          send_timeout 65s;
          # server_names_hash_bucket_size 64;
          # server_name_in_redirect off;
          include /etc/nginx/mime.types;
          default_type application/octet-stream;
          ##
          # SSL Settings
          ##
          ssl_protocols TLSv1.2;
          ssl_prefer_server_ciphers on;
          ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256;
          ##
          # Logging Settings
          ##
          access_log /var/log/nginx/access.log;
          error_log /var/log/nginx/error.log;
          ##
          # Gzip Settings
          ##
          gzip on;
          ##
          # Virtual Host Configs
          ##
          include /etc/nginx/conf.d/modelarts-model-server.conf;
      }
    4. Write the modelarts-model-server.conf configuration file.
      server {
          client_max_body_size 15M;
          large_client_header_buffers 4 64k;
          client_header_buffer_size 1k;
          client_body_buffer_size 16k;
          ssl_certificate /etc/nginx/ssl/server/server.crt;
          ssl_password_file /etc/nginx/keys/fifo;
          ssl_certificate_key /etc/nginx/ssl/server/server.key;
          # setting for mutual ssl with client
          ##
          # header Settings
          ##
          add_header X-XSS-Protection "1; mode=block";
          add_header X-Frame-Options SAMEORIGIN;
          add_header X-Content-Type-Options nosniff;
          add_header Strict-Transport-Security "max-age=31536000; includeSubdomains;";
          add_header Content-Security-Policy "default-src 'self'";
          add_header Cache-Control "max-age=0, no-cache, no-store, must-revalidate";
          add_header Pragma "no-cache";
          add_header Expires "-1";
          server_tokens off;
          port_in_redirect off;
          fastcgi_hide_header X-Powered-By;
          ssl_session_timeout 2m;
          ##
          # SSL Settings
          ##
          ssl_protocols TLSv1.2;
          ssl_prefer_server_ciphers on;
          ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256;
          listen    0.0.0.0:8080 ssl;
          error_page 502 503 /503.html;
          location /503.html {
              return 503 '{"error_code": "ModelArts.4503","error_msg": "Failed to connect to backend service, please confirm your service is connectable. "}';
          }
          location / {
      #       limit_req zone=mylimit;
      #       limit_req_status 429;
              proxy_pass http://127.0.0.1:8501;
          }
      }
    5. Create a startup script.
      NOTE:

      Before executing the TF Serving startup script, you must create an SSL certificate.

      The sample code of the startup script run.sh is as follows:

      #!/bin/bash
      mkdir -p /etc/nginx/ssl/server && cd /etc/nginx/ssl/server
      cipherText=$(openssl rand -base64 32)
      openssl genrsa -aes256 -passout pass:"${cipherText}" -out server.key 2048
      openssl rsa -in server.key -passin pass:"${cipherText}" -pubout -out rsa_public.key
      openssl req -new -key server.key -passin pass:"${cipherText}" -out server.csr -subj "/C=CN/ST=GD/L=SZ/O=Huawei/OU=ops/CN=*.huawei.com"
      openssl genrsa -out ca.key 2048
      openssl req -new -x509 -days 3650 -key ca.key -out ca-crt.pem -subj "/C=CN/ST=GD/L=SZ/O=Huawei/OU=dev/CN=ca"
      openssl x509 -req -days 3650 -in server.csr -CA ca-crt.pem -CAkey ca.key -CAcreateserial -out server.crt
      service nginx start &
      echo ${cipherText} > /etc/nginx/keys/fifo
      unset cipherText
      sh /usr/bin/tf_serving_entrypoint.sh

  3. Modify the default model path to support ModelArts model dynamic loading.

    Run the following commands in the Dockerfile to change the default model path:

    ENV MODEL_BASE_PATH /home/mind
    ENV MODEL_NAME model

Dockerfile example:

FROM tensorflow/serving:2.8.0
RUN useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user
RUN apt-get update && apt-get -y --no-install-recommends install nginx && apt-get clean
RUN mkdir /home/mind && \
    mkdir -p /etc/nginx/keys && \
    mkfifo /etc/nginx/keys/fifo && \
    chown -R ma-user:100 /home/mind && \
    rm -rf /etc/nginx/conf.d/default.conf && \
    chown -R ma-user:100 /etc/nginx/ && \
    chown -R ma-user:100 /var/log/nginx && \
    chown -R ma-user:100 /var/lib/nginx && \
    sed -i "s#/var/run/nginx.pid#/home/ma-user/nginx.pid#g" /etc/init.d/nginx
ADD nginx /etc/nginx
ADD run.sh /home/mind/
ENV MODEL_BASE_PATH /home/mind
ENV MODEL_NAME model
ENTRYPOINT []
CMD /bin/bash /home/mind/run.sh

Migrating Triton

This section uses the nvcr.io/nvidia/tritonserver:23.03-py3 image provided by NVIDIA for adaptation and the open-source foundation model LLaMA 7B for inference.

  1. Add user ma-user.

    The triton-server user, whose ID is 1000, exists in the Triton image by default. Change the triton-server user ID and add the ma-user user by running this command in the Dockerfile.

    RUN usermod -u 1001 triton-server && useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user

  2. Set up an Nginx proxy to support HTTPS.

    1. Run the following commands in the Dockerfile to install and configure Nginx:
      RUN apt-get update && apt-get -y --no-install-recommends install nginx && apt-get clean && \
          mkdir /home/mind && \
          mkdir -p /etc/nginx/keys && \
          mkfifo /etc/nginx/keys/fifo && \
          chown -R ma-user:100 /home/mind && \
          rm -rf /etc/nginx/conf.d/default.conf && \
          chown -R ma-user:100 /etc/nginx/ && \
          chown -R ma-user:100 /var/log/nginx && \
          chown -R ma-user:100 /var/lib/nginx && \
          sed -i "s#/var/run/nginx.pid#/home/ma-user/nginx.pid#g" /etc/init.d/nginx
    2. Create the Nginx directory as follows:
      nginx
      ├──nginx.conf
      └──conf.d
             ├── modelarts-model-server.conf
    3. Write the nginx.conf file.
      user ma-user 100;
      worker_processes 2;
      pid /home/ma-user/nginx.pid;
      include /etc/nginx/modules-enabled/*.conf;
      events {
          worker_connections 768;
      }
      http {
          ##
          # Basic Settings
          ##
          sendfile on;
          tcp_nopush on;
          tcp_nodelay on;
          types_hash_max_size 2048;
          fastcgi_hide_header X-Powered-By;
          port_in_redirect off;
          server_tokens off;
          client_body_timeout 65s;
          client_header_timeout 65s;
          keepalive_timeout 65s;
          send_timeout 65s;
          # server_names_hash_bucket_size 64;
          # server_name_in_redirect off;
          include /etc/nginx/mime.types;
          default_type application/octet-stream;
          ##
          # SSL Settings
          ##
          ssl_protocols TLSv1.2;
          ssl_prefer_server_ciphers on;
          ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256;
          ##
          # Logging Settings
          ##
          access_log /var/log/nginx/access.log;
          error_log /var/log/nginx/error.log;
          ##
          # Gzip Settings
          ##
          gzip on;
          ##
          # Virtual Host Configs
          ##
          include /etc/nginx/conf.d/modelarts-model-server.conf;
      }
    4. Write the modelarts-model-server.conf configuration file.
      server {
          client_max_body_size 15M;
          large_client_header_buffers 4 64k;
          client_header_buffer_size 1k;
          client_body_buffer_size 16k;
          ssl_certificate /etc/nginx/ssl/server/server.crt;
          ssl_password_file /etc/nginx/keys/fifo;
          ssl_certificate_key /etc/nginx/ssl/server/server.key;
          # setting for mutual ssl with client
          ##
          # header Settings
          ##
          add_header X-XSS-Protection "1; mode=block";
          add_header X-Frame-Options SAMEORIGIN;
          add_header X-Content-Type-Options nosniff;
          add_header Strict-Transport-Security "max-age=31536000; includeSubdomains;";
          add_header Content-Security-Policy "default-src 'self'";
          add_header Cache-Control "max-age=0, no-cache, no-store, must-revalidate";
          add_header Pragma "no-cache";
          add_header Expires "-1";
          server_tokens off;
          port_in_redirect off;
          fastcgi_hide_header X-Powered-By;
          ssl_session_timeout 2m;
          ##
          # SSL Settings
          ##
          ssl_protocols TLSv1.2;
          ssl_prefer_server_ciphers on;
          ssl_ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256;
          listen    0.0.0.0:8080 ssl;
          error_page 502 503 /503.html;
          location /503.html {
              return 503 '{"error_code": "ModelArts.4503","error_msg": "Failed to connect to backend service, please confirm your service is connectable. "}';
          }
          location / {
      #       limit_req zone=mylimit;
      #       limit_req_status 429;
              proxy_pass http://127.0.0.1:8000;
          }
      }
    5. Create a startup script run.sh.
      NOTE:

      Before executing the Triton startup script, you must create an SSL certificate.

      #!/bin/bash
      mkdir -p /etc/nginx/ssl/server && cd /etc/nginx/ssl/server
      cipherText=$(openssl rand -base64 32)
      openssl genrsa -aes256 -passout pass:"${cipherText}" -out server.key 2048
      openssl rsa -in server.key -passin pass:"${cipherText}" -pubout -out rsa_public.key
      openssl req -new -key server.key -passin pass:"${cipherText}" -out server.csr -subj "/C=CN/ST=GD/L=SZ/O=Huawei/OU=ops/CN=*.huawei.com"
      openssl genrsa -out ca.key 2048
      openssl req -new -x509 -days 3650 -key ca.key -out ca-crt.pem -subj "/C=CN/ST=GD/L=SZ/O=Huawei/OU=dev/CN=ca"
      openssl x509 -req -days 3650 -in server.csr -CA ca-crt.pem -CAkey ca.key -CAcreateserial -out server.crt
      service nginx start &
      echo ${cipherText} > /etc/nginx/keys/fifo
      unset cipherText
      
      bash /home/mind/model/triton_serving.sh

  3. Set up tensorrtllm_backend.

    1. Obtain the source code of tensorrtllm_backend; install dependencies (TensorRT, CMake, and PyTorch); compile and install.
      # get tensortllm_backend source code
      WORKDIR /opt/tritonserver
      RUN apt-get install -y --no-install-recommends rapidjson-dev python-is-python3 git-lfs && \
          git config --global http.sslVerify false && \
          git config --global http.postBuffer 1048576000 && \
          git clone -b v0.5.0 https://github.com/triton-inference-server/tensorrtllm_backend.git --depth 1 && \
          cd tensorrtllm_backend && git lfs install && \
          git config submodule.tensorrt_llm.url https://github.com/NVIDIA/TensorRT-LLM.git && \
          git submodule update --init --recursive --depth 1 && \
          pip3 install -r requirements.txt
      
      # build tensorrtllm_backend
      WORKDIR /opt/tritonserver/tensorrtllm_backend/tensorrt_llm
      RUN sed -i "s/wget/wget --no-check-certificate/g" docker/common/install_tensorrt.sh && \
          bash docker/common/install_tensorrt.sh && \
          export  LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH} && \
          sed -i "s/wget/wget --no-check-certificate/g" docker/common/install_cmake.sh && \
          bash docker/common/install_cmake.sh && \
          export PATH=/usr/local/cmake/bin:$PATH && \
          bash docker/common/install_pytorch.sh pypi && \
          python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt && \
          pip install ./build/tensorrt_llm-0.5.0-py3-none-any.whl && \
          rm -f ./build/tensorrt_llm-0.5.0-py3-none-any.whl && \
          cd ../inflight_batcher_llm && bash scripts/build.sh && \
          mkdir /opt/tritonserver/backends/tensorrtllm && \
          cp ./build/libtriton_tensorrtllm.so /opt/tritonserver/backends/tensorrtllm/ && \
          chown -R ma-user:100 /opt/tritonserver
    2. Create the startup script triton_serving.sh of Triton serving. The following is an example for the LLaMA model:
      MODEL_NAME=llama_7b
      MODEL_DIR=/home/mind/model/${MODEL_NAME}
      OUTPUT_DIR=/tmp/llama/7B/trt_engines/fp16/1-gpu/
      MAX_BATCH_SIZE=1
      export LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH}
      
      # build tensorrt_llm engine
      cd /opt/tritonserver/tensorrtllm_backend/tensorrt_llm/examples/llama
      python build.py --model_dir ${MODEL_DIR} \
                      --dtype float16 \
                      --remove_input_padding \
                      --use_gpt_attention_plugin float16 \
                      --enable_context_fmha \
                      --use_weight_only \
                      --use_gemm_plugin float16 \
                      --output_dir ${OUTPUT_DIR} \
                      --paged_kv_cache \
                      --max_batch_size ${MAX_BATCH_SIZE}
      
      # set config parameters
      cd /opt/tritonserver/tensorrtllm_backend
      mkdir triton_model_repo
      cp all_models/inflight_batcher_llm/* triton_model_repo/ -r
      
      python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:${MODEL_DIR},tokenizer_type:llama,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1
      python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:${MODEL_DIR},tokenizer_type:llama,triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1
      python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE}
      python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,max_beam_width:1,engine_dir:${OUTPUT_DIR},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:V1,max_queue_delay_microseconds:600
      
      # launch tritonserver
      python3 scripts/launch_triton_server.py --world_size 1 --model_repo=triton_model_repo/
      while true; do sleep 10000; done

      Description of some parameters:

      • MODEL_NAME: name of the OBS folder where the model weight file in Hugging Face format is stored.
      • OUTPUT_DIR: path to the model file converted by TensorRT-LLM in the container.

      The complete Dockerfile is as follows:

      FROM nvcr.io/nvidia/tritonserver:23.03-py3
      
      # add ma-user and install nginx
      RUN usermod -u 1001 triton-server && useradd -d /home/ma-user -m -u 1000 -g 100 -s /bin/bash ma-user && \
          apt-get update && apt-get -y --no-install-recommends install nginx && apt-get clean && \
          mkdir /home/mind && \
          mkdir -p /etc/nginx/keys && \
          mkfifo /etc/nginx/keys/fifo && \
          chown -R ma-user:100 /home/mind && \
          rm -rf /etc/nginx/conf.d/default.conf && \
          chown -R ma-user:100 /etc/nginx/ && \
          chown -R ma-user:100 /var/log/nginx && \
          chown -R ma-user:100 /var/lib/nginx && \
          sed -i "s#/var/run/nginx.pid#/home/ma-user/nginx.pid#g" /etc/init.d/nginx
      
      # get tensortllm_backend source code
      WORKDIR /opt/tritonserver
      RUN apt-get install -y --no-install-recommends rapidjson-dev python-is-python3 git-lfs && \
          git config --global http.sslVerify false && \
          git config --global http.postBuffer 1048576000 && \
          git clone -b v0.5.0 https://github.com/triton-inference-server/tensorrtllm_backend.git --depth 1 && \
          cd tensorrtllm_backend && git lfs install && \
          git config submodule.tensorrt_llm.url https://github.com/NVIDIA/TensorRT-LLM.git && \
          git submodule update --init --recursive --depth 1 && \
          pip3 install -r requirements.txt
      
      # build tensorrtllm_backend
      WORKDIR /opt/tritonserver/tensorrtllm_backend/tensorrt_llm
      RUN sed -i "s/wget/wget --no-check-certificate/g" docker/common/install_tensorrt.sh && \
          bash docker/common/install_tensorrt.sh && \
          export  LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH} && \
          sed -i "s/wget/wget --no-check-certificate/g" docker/common/install_cmake.sh && \
          bash docker/common/install_cmake.sh && \
          export PATH=/usr/local/cmake/bin:$PATH && \
          bash docker/common/install_pytorch.sh pypi && \
          python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt && \
          pip install ./build/tensorrt_llm-0.5.0-py3-none-any.whl && \
          rm -f ./build/tensorrt_llm-0.5.0-py3-none-any.whl && \
          cd ../inflight_batcher_llm && bash scripts/build.sh && \
          mkdir /opt/tritonserver/backends/tensorrtllm && \
          cp ./build/libtriton_tensorrtllm.so /opt/tritonserver/backends/tensorrtllm/ && \
          chown -R ma-user:100 /opt/tritonserver
      
      ADD nginx /etc/nginx
      ADD run.sh /home/mind/
      CMD /bin/bash /home/mind/run.sh

      After the image is created, register the image with Huawei Cloud SWR for deploying inference services on ModelArts.

  4. Use the adapted image to deploy a real-time inference service on ModelArts.

    1. Create a model directory in OBS and upload the triton_serving.sh file and llama_7b folder to the model directory.
      Figure 2 Uploading files to the model directory
    2. Create an AI application. Set Meta Model Source to OBS and select the meta model from the model directory. Set AI Engine to Custom. Set Engine Package to the image created in 3.
      Figure 3 Creating an AI application
    3. Deploy the created AI application as a real-time service. Generally, the time for loading and starting a large model is longer than that for a common model. Set Timeout to a proper value. Otherwise, the timeout may elapse prior to the completion of the model startup, and the deployment may fail.
      Figure 4 Deploying a real-time service
    4. Call the real-time service for foundation model inference. Set the request path to /v2/models/ensemble/infer. The following is an example call:
      {
          "inputs": [
              {
                  "name": "text_input",
                  "shape": [1, 1],
                  "datatype": "BYTES",
                  "data": ["what is machine learning"]
              },
              {
                  "name": "max_tokens",
                  "shape": [1, 1],
                  "datatype": "UINT32",
                  "data": [64]
              },
              {
                  "name": "bad_words",
                  "shape": [1, 1],
                  "datatype": "BYTES",
                  "data": [""]
              },
              {
                  "name": "stop_words",
                  "shape": [1, 1],
                  "datatype": "BYTES",
                  "data": [""]
              },
              {
                  "name": "pad_id",
                  "shape": [1, 1],
                  "datatype": "UINT32",
                  "data": [2]
              },
              {
                  "name": "end_id",
                  "shape": [1, 1],
                  "datatype": "UINT32",
                  "data": [2]
              }
          ],
          "outputs": [
              {
                  "name": "text_output"
              }
          ]
      }
      NOTE:
      • In "inputs", the element with the "name" "text_input" represents the input, and its "data" field specifies a specific input statement. In this example, the input statement is "what is machine learning".
      • The element with the "name" "max_tokens" indicates the maximum number of output tokens. In this case, the value is 64.
      Figure 5 Calling a real-time service

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback