Starting a Multimodal Model-powered Inference Service

What Is Multimodality?

Multimodality refers to the methods and techniques for integrating and processing two or more different types of information or data. Specifically, in the fields of machine learning and AI, multimodal data typically includes, but is not limited to, text, images, video, audio, and sensor data.

The primary goal of multimodality is to leverage information from multiple modalities to enhance task performance, deliver richer user experiences, or achieve more comprehensive data analysis. For example, in real-world applications, combining image and text data can lead to improved object recognition or sentiment analysis.

Furthermore, multimodality can be subdivided into the following areas:

Multimodal understanding: How computers extract useful information from various types of data sources and synthesize it into meaningful knowledge.

Vision models: These models are specifically designed for images and other visual data, helping computers better understand and interpret the visual world.

Multimodal retrieval: Techniques that use multiple data modalities (such as text, images, video, and audio) for information retrieval, aiming to provide more accurate results by integrating different forms of data.

In summary, multimodality is not merely about simple feature fusion. It encompasses a broad theoretical foundation and practical applications. In this context, multimodality refers specifically to multimodal understanding.

Constraints

For details about models and their PUs, see Minimum Number of PUs and Maximum Sequence Length Supported by Each Model.

Setting Common Inference Environment Variables

This topic describes how to start an LLM-powered inference service, including offline inference and online inference.

You must set these common framework environment variables for both offline and online inference.

Common environment variables

export VLLM_IMAGE_FETCH_TIMEOUT=100
export VLLM_ENGINE_ITERATION_TIMEOUT_S=600

# Prioritize setting PYTORCH_NPU_ALLOC_CONF to expandable_segments:True.
# If an error related to virtual GPU memory is reported, set it to expandable_segments:False.
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True

export VLLM_USE_V1=1
export VLLM_PLUGINS=ascend_vllm

# When the model startup needs to support ultra-large input, VLLM_ALLOW_LONG_MAX_MODEL_LE needs to be set.
# For example, qwen2.5-vl-72B and qwen2.5-vl-32B support 128K long sequences.
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1

# This parameter needs to be set for offline inference.
export VLLM_WORKER_MULTIPROC_METHOD=spawn

Starting Offline Inference

Use AscendCloud-LLM/llm_inference/ascend_vllm/vllm-gpu-0.9.0/examples/offline_inference/vision_language.py script to perform multimodal offline inference.

Find the entry function of the corresponding model series and modify the model weight location, for example, qwen2.5-vl. Change the red box in the figure below to the model weight location.
Figure 1 Modifying the model weight location
Modify model parameters in LLM of the entry function of the corresponding model series.
Parameters:
- model: model address in Hugging Face format.
- max_num_seqs: maximum number of requests that can be processed concurrently.
- max_model_len: maximum number of input and output tokens during inference.
- max_num_batched_tokens: maximum number of tokens that can be used in the prefill phase. The value must be greater than or equal to the value of --max-model-len. The value 4096 or 8192 is recommended.
- dtype: data type for model inference.
- tensor_parallel_size: number of PUs you want to use.
- block-size: block size of PagedAttention. The recommended value is 128.
- gpu-memory-utilization: ratio of the GPU memory used by the NPU. The input parameter name of the original vLLM is reused. The default value is 0.9.
- trust_remote_code: Indicates whether to trust remote code.
- distributed_executor_backend="ray": Uses Ray for communication.

Modify model parameters in SamplingParams.

Figure 2 Setting parameters in SamplingParams
Click to enlarge

**Table 1** Parameters
Parameter	Mandatory	Default Value	Type	Description
max_tokens	No	16	Int	Maximum number of tokens to be generated for each output sequence.
top_k	No	-1	Int	Determines how many of the highest ranking tokens are considered. The value -1 indicates that all tokens are considered. Decreasing the value can reduce the sampling time.
top_p	No	1.0	Float	A floating point number that controls the cumulative probability of the first several tokens to be considered. The value must be in the range (0, 1]. The value 1 indicates that all tokens are considered.
temperature	No	1.0	Float	A floating-point number that controls the randomness of sampling. Lower values make the model more deterministic, while higher values make it more random. The value 0 indicates greedy sampling.
stream	No	False	Bool	Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
ignore_eos	No	False	Bool	Indicates whether to ignore EOS and continue to generate tokens.
repetition_penalty	No	1.0	Float	Reduces the likelihood of generating repetitive text.
stop_token_ids	No	None	Int	Stops the token list. This parameter needs to be input for InternVL 2.5. For details, see stop_token_ids in the offline inference script AscendCloud-LLM/llm_inference/ascend_vllm/vllm-gpu-0.9.0/examples/offline_inference/vision_language.py.

Specify the image path. To specify the local image path, add the code from PIL import Image to the beginning of the file, and add the code get_multi_modal_input function: image = Image.open({image_path}).convert('RGB').
Figure 3 Specifying the image path
Specify the input text.
Figure 4 Specifying the input text
Start the inference script.
Commands for starting the Qwen2.5-VL series models
```
python vision_language.py --model-type qwen2_5_vl
```
Script parameters:

--model-type: model type. Currently, the options are internvl_chat and qwen2_5_vl.

--num-prompts: number of prompts entered at a time. The default value is 4.

--modality: input type. The options are image and video. The default value is image.

--num-frames: number of frames extracted from the video. The default value is 16.

Starting Online Inference

python -m vllm.entrypoints.openai.api_server --model ${container_model_path} \
--max-num-seqs=256 \
--max-model-len=4096 \
--max-num-batched-tokens=4096 \
--tensor-parallel-size=1 \
--block-size=128 \
--dtype ${dtype} \
--host=${docker_ip} \
--port=${port} \
--gpu-memory-utilization=0.9 \
--quantization=${quantization} \
--trust-remote-code \
--compilation_config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64]}' \
--enforce-eager

The following describes the parameters in the template for starting a multimodal inference service. For details about how to set other parameters, see basic parameters in Starting an LLM-powered Inference Service.

VLLM_IMAGE_FETCH_TIMEOUT: environment variable for the image download time.
VLLM_ENGINE_ITERATION_TIMEOUT_S: maximum service interval. Exceeding this duration will result in a timeout error.
PYTORCH_NPU_ALLOC_CONF=expandable_segments:True: Allows the allocator to initially create a segment and later expand its size when more memory is needed. Enabling this may improve model performance. Disable it if errors occur.
VLLM_USE_V1=1: Indicates whether to enable the v1 framework of vLLM.
VLLM_PLUGINS=ascend_vllm: Activates the ascend_vllm optimization plug-in of VLLM_PLUGINS.
--model ${container_model_path}: model address in Hugging Face format, which stores the model's weight files. If quantization is used, the weights converted in Quantization are used. If the address for converting the trained model into the Hugging Face format is used, the original tokenizer file is required.
--max-num-seqs: maximum number of concurrent requests that can be processed. Requests exceeding this limit will wait in the queue.
--max-model-len: maximum number of input and output tokens during inference. The value of max-model-len must be less than that of seq_length in the config.json file. Otherwise, an error will be reported during inference and prediction. The config.json file is stored in the path corresponding to the model, for example, ${container_model_path}/chatglm3-6b/config.json. The maximum length varies among different models. For details, see Table 1.
--max-num-batched-tokens: maximum number of tokens that can be used in the prefill phase. The value must be greater than or equal to the value of --max-model-len. The value 4096 or 8192 is recommended.
--dtype: data type for model inference, which can be FP16 or BF16. float16 indicates FP16, and bfloat16 indicates BF16. If unspecified, the data type is auto-matched based on input data. Using different dtype may affect model precision. When using open-source weights, you are advised not to specify dtype and instead use the default dtype of the open-source weights.
--tensor-parallel-size: number of PUs you want to use. The product of model parallelism and pipeline parallelism must be the same as the number of started NPUs. For details, see Table 1. The value 1 indicates that the service is started using a single PU.
--block-size: block size of kv-cache. The recommended value is 128.
--host=${docker_ip}: IP address for service deployment. Replace ${docker_ip} with the actual IP address of the host machine. The default value is None. For example, the parameter can be set to 0.0.0.0.
--port: port where the service is deployed.
--gpu-memory-utilization: ratio of the GPU memory used by the NPU. The input parameter name of the original vLLM is reused. The default value is 0.9.
--trust-remote-code: Specifies whether to trust the remote code.
--chat-template: (optional) chat building template.
--quantization: quantization option. This parameter is optional. Currently, only awq is supported. If this parameter is not transferred, the default value None is used, indicating that quantization is disabled.
--compilation_config: compilation options of the model. This parameter is optional. For example, {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64]} is used to enable graph capture in different batch sizes.
--enforce-eager: Boots in eager mode instead of acl mode. This parameter is optional.

Multimodal Inference Request

Send a request using online_serving.py (single-image, single-turn dialog).

Since multimodal inference involves image encoding and decoding, service APIs are invoked via scripting. The parameters that need to be configured in the script are described in Table 2.

import base64
import requests
import argparse
import json
from typing import List
# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

def get_stop_token_ids(model_path):
    with open(f"{model_path}/config.json") as file:
        data = json.load(file)
        if data.get('architectures')[0] == "InternVLChatModel":
            return [0, 92543, 92542]
    return None

def post_img(args):
    # Path to your image
    image_path = args.image_path
    # Getting the base64 string
    image_base64 = encode_image(image_path)
    stop_token_ids = args.stop_token_ids if args.stop_token_ids is not None else get_stop_token_ids(args.model_path)
    headers = {
      "Content-Type": "application/json"
    }
    payload = {
      "model": args.model_path,
      "messages": [
        {
          "role": "user",
          "content": [
            {
              "type": "text",
              "text": args.text
            },
            {
              "type": "image_url",
              "image_url": {
                "url": f"data:image/jpeg;base64,{image_base64}"
              }
            }
          ]
        }
      ],
      "max_tokens": args.max_tokens,
      "temperature": args.temperature,
      "ignore_eos": args.ignore_eos,
      "stream": args.stream,
      "top_k": args.top_k,
      "top_p": args.top_p,
      "stop_token_ids": stop_token_ids,
      "repetition_penalty": args.repetition_penalty,
    }
    response = requests.post(f"http://{args.docker_ip}:{args.served_port}/v1/chat/completions", headers=headers, json=payload)
    print(response.json())

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    # Mandatory
    parser.add_argument("--model-path", type=str, required=True)
    parser.add_argument("--image-path", type=str, required=True)
    parser.add_argument("--docker-ip", type=str, required=True)
    parser.add_argument("--served-port", type=str, required=True)
    parser.add_argument("--text", type=str, required=True)
    # Optional
    parser.add_argument("--temperature", type=float, default=0) # Randomness of the output result. This parameter is optional.
    parser.add_argument("--ignore-eos", type=bool, default=False) # Specifies whether to ignore the end symbol during generation and continue to generate tokens after an EOS token is generated. This parameter is optional.
    parser.add_argument("--top-k", type=int, default=-1) # Controls result diversity. A lower value makes text more unique but less coherent. A higher value improves coherence but reduces diversity. This parameter is optional.
    parser.add_argument("--top-p", type=int, default=1.0) # The value ranges between 0 and 1. A smaller value results in more unexpected outputs, but may sacrifice coherence. A larger value leads to more coherent content, but reduces novelty. This parameter is optional.
    parser.add_argument("--stream", type=int, default=False) # Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
    parser.add_argument("--max-tokens", type=int, default=16) # Maximum length of the generated sequence. This parameter is mandatory.
    parser.add_argument("--repetition-penalty", type=float, default=1.0) # Reduces the likelihood of generating repetitive text. This parameter is optional.
    parser.add_argument("--stop-token-ids", nargs='+', type=int, default=None) # Stops the token list. This parameter is optional.
    args = parser.parse_args()
    post_img(args)

Run this script:

python online_serving.py --model-path ${container_model_path} --image-path ${image_path} --docker-ip ${docker_ip} --served-port ${port} --text What is the image content?

For details about request parameters, see Multimodal Request Parameters.

Figure 5 Input example
Click to enlarge

Multimodal Request Parameters

**Table 2** Script parameters
Parameter	Mandatory	Type	Description
container_model_path	Yes	str	Model weight file path
image_path	Yes	str	Path of the image passed to the model for inference
docker_ip	Yes	str	IP address of the host where the multimodal OpenAI service is started
served_port	Yes	str	Port number for starting the multimodal OpenAI service
text	Yes	str	Prompts passed to the model for inference

**Table 3** JSON parameters in the payload of a POST request
Parameter	Mandatory	Default Value	Type	Description
model	Yes	None	Str	This parameter is mandatory for an inference request when the service is started through the OpenAI service API. The value must be the same as the value of model ${container_model_path} when the inference service is started.
messages	Yes	-	Dict	Input question and image of the request. role: indicates the message sender, which can only be a user here. content: indicates the message content, with the type being a list. For a single-image single-turn dialogue, the content must include two elements. The first element has its type field set to text, representing the text type, and the text field contains the string of the input question. The second element has its type field set to image_url, representing the image type, and the image_url field contains the Base64 encoding of the input image.
max_tokens	No	16	Int	Maximum number of tokens to be generated for each output sequence.
top_k	No	-1	Int	Determines how many of the highest ranking tokens are considered. The value -1 indicates that all tokens are considered. Decreasing the value can reduce the sampling time.
top_p	No	1.0	Float	A floating point number that controls the cumulative probability of the first several tokens to be considered. The value must be in the range (0, 1]. The value 1 indicates that all tokens are considered.
temperature	No	1.0	Float	A floating-point number that controls the randomness of sampling. Lower values make the model more deterministic, while higher values make it more random. The value 0 indicates greedy sampling.
stream	No	False	Bool	Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
ignore_eos	No	False	Bool	Indicates whether to ignore EOS and continue to generate tokens.
repetition_penalty	No	1.0	Float	Reduces the likelihood of generating repetitive text.
stop_token_ids	No	None	Int	Stops the token list. This parameter needs to be input for InternVL2 and MiniCPM. For details, see stop_token_ids in the offline inference script examples/offline_inference_vision_language.py.