Starting a Multimodal Model-powered Inference Service
What Is Multimodality?
Multimodality refers to the methods and techniques for integrating and processing two or more different types of information or data. Specifically, in the fields of machine learning and AI, multimodal data typically includes, but is not limited to, text, images, video, audio, and sensor data.
The primary goal of multimodality is to leverage information from multiple modalities to enhance task performance, deliver richer user experiences, or achieve more comprehensive data analysis. For example, in real-world applications, combining image and text data can lead to improved object recognition or sentiment analysis.
Furthermore, multimodality can be subdivided into the following areas:
- Multimodal understanding: How computers extract useful information from various types of data sources and synthesize it into meaningful knowledge.
- Vision models: These models are specifically designed for images and other visual data, helping computers better understand and interpret the visual world.
- Multimodal retrieval: Techniques that use multiple data modalities (such as text, images, video, and audio) for information retrieval, aiming to provide more accurate results by integrating different forms of data.
In summary, multimodality is not merely about simple feature fusion. It encompasses a broad theoretical foundation and practical applications. In this context, multimodality refers specifically to multimodal understanding.
Constraints
For details about models and their PUs, see Minimum Number of PUs and Maximum Sequence Length Supported by Each Model.
Setting Common Inference Environment Variables
This topic describes how to start an LLM-powered inference service, including offline inference and online inference.
You must set these common framework environment variables for both offline and online inference.
Common environment variables
export VLLM_IMAGE_FETCH_TIMEOUT=100 export VLLM_ENGINE_ITERATION_TIMEOUT_S=600 # Prioritize setting PYTORCH_NPU_ALLOC_CONF to expandable_segments:True. # If an error related to virtual GPU memory is reported, set it to expandable_segments:False. export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export VLLM_USE_V1=1 export VLLM_PLUGINS=ascend_vllm # When the model startup needs to support ultra-large input, VLLM_ALLOW_LONG_MAX_MODEL_LE needs to be set. # For example, qwen2.5-vl-72B and qwen2.5-vl-32B support 128K long sequences. export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 # This parameter needs to be set for offline inference. export VLLM_WORKER_MULTIPROC_METHOD=spawn
Starting Offline Inference
Use AscendCloud-LLM/llm_inference/ascend_vllm/vllm-gpu-0.9.0/examples/offline_inference/vision_language.py script to perform multimodal offline inference.
- Find the entry function of the corresponding model series and modify the model weight location, for example, qwen2.5-vl. Change the red box in the figure below to the model weight location.
Figure 1 Modifying the model weight location
- Modify model parameters in LLM of the entry function of the corresponding model series.
Parameters:
- model: model address in Hugging Face format.
- max_num_seqs: maximum number of requests that can be processed concurrently.
- max_model_len: maximum number of input and output tokens during inference.
- max_num_batched_tokens: maximum number of tokens that can be used in the prefill phase. The value must be greater than or equal to the value of --max-model-len. The value 4096 or 8192 is recommended.
- dtype: data type for model inference.
- tensor_parallel_size: number of PUs you want to use.
- block-size: block size of PagedAttention. The recommended value is 128.
- gpu-memory-utilization: ratio of the GPU memory used by the NPU. The input parameter name of the original vLLM is reused. The default value is 0.9.
- trust_remote_code: Indicates whether to trust remote code.
- distributed_executor_backend="ray": Uses Ray for communication.
- Modify model parameters in SamplingParams.
Figure 2 Setting parameters in SamplingParams
Table 1 Parameters Parameter
Mandatory
Default Value
Type
Description
max_tokens
No
16
Int
Maximum number of tokens to be generated for each output sequence.
top_k
No
-1
Int
Determines how many of the highest ranking tokens are considered. The value -1 indicates that all tokens are considered. Decreasing the value can reduce the sampling time.
top_p
No
1.0
Float
A floating point number that controls the cumulative probability of the first several tokens to be considered. The value must be in the range (0, 1]. The value 1 indicates that all tokens are considered.
temperature
No
1.0
Float
A floating-point number that controls the randomness of sampling. Lower values make the model more deterministic, while higher values make it more random. The value 0 indicates greedy sampling.
stream
No
False
Bool
Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
ignore_eos
No
False
Bool
Indicates whether to ignore EOS and continue to generate tokens.
repetition_penalty
No
1.0
Float
Reduces the likelihood of generating repetitive text.
stop_token_ids
No
None
Int
Stops the token list. This parameter needs to be input for InternVL 2.5. For details, see stop_token_ids in the offline inference script AscendCloud-LLM/llm_inference/ascend_vllm/vllm-gpu-0.9.0/examples/offline_inference/vision_language.py.
- Specify the image path. To specify the local image path, add the code from PIL import Image to the beginning of the file, and add the code get_multi_modal_input function: image = Image.open({image_path}).convert('RGB').
Figure 3 Specifying the image path
- Specify the input text.
Figure 4 Specifying the input text
- Start the inference script.
Commands for starting the Qwen2.5-VL series models
python vision_language.py --model-type qwen2_5_vl
Script parameters:
--model-type: model type. Currently, the options are internvl_chat and qwen2_5_vl.
--num-prompts: number of prompts entered at a time. The default value is 4.
--modality: input type. The options are image and video. The default value is image.
--num-frames: number of frames extracted from the video. The default value is 16.
Starting Online Inference
python -m vllm.entrypoints.openai.api_server --model ${container_model_path} \
--max-num-seqs=256 \
--max-model-len=4096 \
--max-num-batched-tokens=4096 \
--tensor-parallel-size=1 \
--block-size=128 \
--dtype ${dtype} \
--host=${docker_ip} \
--port=${port} \
--gpu-memory-utilization=0.9 \
--quantization=${quantization} \
--trust-remote-code \
--compilation_config '{"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64]}' \
--enforce-eager
The following describes the parameters in the template for starting a multimodal inference service. For details about how to set other parameters, see basic parameters in Starting an LLM-powered Inference Service.
- VLLM_IMAGE_FETCH_TIMEOUT: environment variable for the image download time.
- VLLM_ENGINE_ITERATION_TIMEOUT_S: maximum service interval. Exceeding this duration will result in a timeout error.
- PYTORCH_NPU_ALLOC_CONF=expandable_segments:True: Allows the allocator to initially create a segment and later expand its size when more memory is needed. Enabling this may improve model performance. Disable it if errors occur.
- VLLM_USE_V1=1: Indicates whether to enable the v1 framework of vLLM.
- VLLM_PLUGINS=ascend_vllm: Activates the ascend_vllm optimization plug-in of VLLM_PLUGINS.
- --model ${container_model_path}: model address in Hugging Face format, which stores the model's weight files. If quantization is used, the weights converted in Quantization are used. If the address for converting the trained model into the Hugging Face format is used, the original tokenizer file is required.
- --max-num-seqs: maximum number of concurrent requests that can be processed. Requests exceeding this limit will wait in the queue.
- --max-model-len: maximum number of input and output tokens during inference. The value of max-model-len must be less than that of seq_length in the config.json file. Otherwise, an error will be reported during inference and prediction. The config.json file is stored in the path corresponding to the model, for example, ${container_model_path}/chatglm3-6b/config.json. The maximum length varies among different models. For details, see Table 1.
- --max-num-batched-tokens: maximum number of tokens that can be used in the prefill phase. The value must be greater than or equal to the value of --max-model-len. The value 4096 or 8192 is recommended.
- --dtype: data type for model inference, which can be FP16 or BF16. float16 indicates FP16, and bfloat16 indicates BF16. If unspecified, the data type is auto-matched based on input data. Using different dtype may affect model precision. When using open-source weights, you are advised not to specify dtype and instead use the default dtype of the open-source weights.
- --tensor-parallel-size: number of PUs you want to use. The product of model parallelism and pipeline parallelism must be the same as the number of started NPUs. For details, see Table 1. The value 1 indicates that the service is started using a single PU.
- --block-size: block size of kv-cache. The recommended value is 128.
- --host=${docker_ip}: IP address for service deployment. Replace ${docker_ip} with the actual IP address of the host machine. The default value is None. For example, the parameter can be set to 0.0.0.0.
- --port: port where the service is deployed.
- --gpu-memory-utilization: ratio of the GPU memory used by the NPU. The input parameter name of the original vLLM is reused. The default value is 0.9.
- --trust-remote-code: Specifies whether to trust the remote code.
- --chat-template: (optional) chat building template.
- --quantization: quantization option. This parameter is optional. Currently, only awq is supported. If this parameter is not transferred, the default value None is used, indicating that quantization is disabled.
- --compilation_config: compilation options of the model. This parameter is optional. For example, {"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 64]} is used to enable graph capture in different batch sizes.
- --enforce-eager: Boots in eager mode instead of acl mode. This parameter is optional.
Multimodal Inference Request
Send a request using online_serving.py (single-image, single-turn dialog).
Since multimodal inference involves image encoding and decoding, service APIs are invoked via scripting. The parameters that need to be configured in the script are described in Table 2.
import base64
import requests
import argparse
import json
from typing import List
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
def get_stop_token_ids(model_path):
with open(f"{model_path}/config.json") as file:
data = json.load(file)
if data.get('architectures')[0] == "InternVLChatModel":
return [0, 92543, 92542]
return None
def post_img(args):
# Path to your image
image_path = args.image_path
# Getting the base64 string
image_base64 = encode_image(image_path)
stop_token_ids = args.stop_token_ids if args.stop_token_ids is not None else get_stop_token_ids(args.model_path)
headers = {
"Content-Type": "application/json"
}
payload = {
"model": args.model_path,
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": args.text
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_base64}"
}
}
]
}
],
"max_tokens": args.max_tokens,
"temperature": args.temperature,
"ignore_eos": args.ignore_eos,
"stream": args.stream,
"top_k": args.top_k,
"top_p": args.top_p,
"stop_token_ids": stop_token_ids,
"repetition_penalty": args.repetition_penalty,
}
response = requests.post(f"http://{args.docker_ip}:{args.served_port}/v1/chat/completions", headers=headers, json=payload)
print(response.json())
if __name__ == "__main__":
parser = argparse.ArgumentParser()
# Mandatory
parser.add_argument("--model-path", type=str, required=True)
parser.add_argument("--image-path", type=str, required=True)
parser.add_argument("--docker-ip", type=str, required=True)
parser.add_argument("--served-port", type=str, required=True)
parser.add_argument("--text", type=str, required=True)
# Optional
parser.add_argument("--temperature", type=float, default=0) # Randomness of the output result. This parameter is optional.
parser.add_argument("--ignore-eos", type=bool, default=False) # Specifies whether to ignore the end symbol during generation and continue to generate tokens after an EOS token is generated. This parameter is optional.
parser.add_argument("--top-k", type=int, default=-1) # Controls result diversity. A lower value makes text more unique but less coherent. A higher value improves coherence but reduces diversity. This parameter is optional.
parser.add_argument("--top-p", type=int, default=1.0) # The value ranges between 0 and 1. A smaller value results in more unexpected outputs, but may sacrifice coherence. A larger value leads to more coherent content, but reduces novelty. This parameter is optional.
parser.add_argument("--stream", type=int, default=False) # Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
parser.add_argument("--max-tokens", type=int, default=16) # Maximum length of the generated sequence. This parameter is mandatory.
parser.add_argument("--repetition-penalty", type=float, default=1.0) # Reduces the likelihood of generating repetitive text. This parameter is optional.
parser.add_argument("--stop-token-ids", nargs='+', type=int, default=None) # Stops the token list. This parameter is optional.
args = parser.parse_args()
post_img(args)
python online_serving.py --model-path ${container_model_path} --image-path ${image_path} --docker-ip ${docker_ip} --served-port ${port} --text What is the image content?
For details about request parameters, see Multimodal Request Parameters.
Multimodal Request Parameters
|
Parameter |
Mandatory |
Type |
Description |
|---|---|---|---|
|
container_model_path |
Yes |
str |
Model weight file path |
|
image_path |
Yes |
str |
Path of the image passed to the model for inference |
|
docker_ip |
Yes |
str |
IP address of the host where the multimodal OpenAI service is started |
|
served_port |
Yes |
str |
Port number for starting the multimodal OpenAI service |
|
text |
Yes |
str |
Prompts passed to the model for inference |
|
Parameter |
Mandatory |
Default Value |
Type |
Description |
|---|---|---|---|---|
|
model |
Yes |
None |
Str |
This parameter is mandatory for an inference request when the service is started through the OpenAI service API. The value must be the same as the value of model ${container_model_path} when the inference service is started. |
|
messages |
Yes |
- |
Dict |
Input question and image of the request. role: indicates the message sender, which can only be a user here. content: indicates the message content, with the type being a list. For a single-image single-turn dialogue, the content must include two elements. The first element has its type field set to text, representing the text type, and the text field contains the string of the input question. The second element has its type field set to image_url, representing the image type, and the image_url field contains the Base64 encoding of the input image. |
|
max_tokens |
No |
16 |
Int |
Maximum number of tokens to be generated for each output sequence. |
|
top_k |
No |
-1 |
Int |
Determines how many of the highest ranking tokens are considered. The value -1 indicates that all tokens are considered. Decreasing the value can reduce the sampling time. |
|
top_p |
No |
1.0 |
Float |
A floating point number that controls the cumulative probability of the first several tokens to be considered. The value must be in the range (0, 1]. The value 1 indicates that all tokens are considered. |
|
temperature |
No |
1.0 |
Float |
A floating-point number that controls the randomness of sampling. Lower values make the model more deterministic, while higher values make it more random. The value 0 indicates greedy sampling. |
|
stream |
No |
False |
Bool |
Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled. |
|
ignore_eos |
No |
False |
Bool |
Indicates whether to ignore EOS and continue to generate tokens. |
|
repetition_penalty |
No |
1.0 |
Float |
Reduces the likelihood of generating repetitive text. |
|
stop_token_ids |
No |
None |
Int |
Stops the token list. This parameter needs to be input for InternVL2 and MiniCPM. For details, see stop_token_ids in the offline inference script examples/offline_inference_vision_language.py. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot