Calling a Model Service in ModelArts Studio (MaaS)

Model services deployed in ModelArts Studio can be called in other service environments. This section describes how to call your deployed model from My Services. You can also call built-in commercial services or endpoints.

Operation Scenarios

When developing AI applications, developers need to deploy trained models into real-world services. This often involves manually setting up environments, handling dependencies, and writing deployment scripts. This approach takes significant time, risks errors, and leads to challenges like complicated setups, hard migrations, costly maintenance, and awkward updates.

MaaS offers a one-stop solution with unified APIs for system integration, along with built-in monitoring and logging features for efficient O&M.

Billing

Model inference services process input by converting it into identifiable tokens. Using a built-in MaaS service charges you based on token count. For details, see Billing.

Prerequisites

Using a built-in service: You have subscribed to a commercial service (the payment status is Subscribed) in the Built-in Services tab on the Real-Time Inference page. For details, see ModelArts Studio (MaaS) Real-Time Inference Services.
Using a custom service: In the My Services tab of the Real-Time Inference page, there is a model service in the Running, Updating, or Upgrading state. For details, see Deploying a Model Service in ModelArts Studio (MaaS).
Using an endpoint: You have created an endpoint. For details, see Creating an Endpoint on ModelArts Studio (MaaS).

Step 1: Obtaining an API Key

When calling a model service deployed in MaaS, you need to enter an API key for API authentication. You can create a up to 30 keys. Each key is displayed only once after creation. Keep it secure. If the key is lost, it cannot be retrieved. In this case, create a new API key. For more information, see Managing API Keys in ModelArts Studio (MaaS).

Log in to the ModelArts Studio (MaaS) console and select the target region on the top navigation bar.
In the navigation pane, choose API Keys.

On the API Keys page, click Create API Key, enter the tag and description, and click OK.

The tag and description cannot be modified after the key is created.

**Table 1** Parameters
Parameter	Description
Tag	Tag of the API key. The tag must be unique. The tag can contain 1 to 100 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed.
Description	Description of the custom API key. The value can contain 1 to 100 characters.

In the Your Key dialog box, copy the key and store it securely.
After the key is saved, click Close.
After you click Close, the key cannot be viewed again.

Step 2: Calling a Model Service for Prediction

In the left navigation pane of the ModelArts Studio (MaaS) console, choose Real-Time Inference.
On the Real-Time Inference page, click the My Services tab. Choose More > View Call Description in the Operation column of the target service.
In the Disable Content Moderation dialog box, select whether to enable content moderation (enabled by default).
- While content moderation can block non-compliant content before it is processed during real-time inference, it can also slow down API performance significantly.
- If content moderation is disabled, the inputs and outputs during real-time inference will not be checked, which may cause violation risks.
  To disable content moderation, switch off the button. In the displayed dialog box, select I have read and agree to the statement and click OK.

On the displayed page, select an API type, copy the example call, modify the API information and API key, and use the information to call the model service API in the service environment.

The following shows sample code for REST APIs and OpenAI SDK.

Sample REST API code:

Example call using Python

import requests
import json

if __name__ == '__main__':
    url = "https:/example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*****/v1/chat/completions"
    api_key = "<your_apiKey>"  # Replace <your_apiKey> with the obtained API key.

    # Send a request.
    headers = {
        'Content-Type': 'application/json',
        'Authorization': f'Bearer {api_key}'
    }
    data = {
        "model": "******",  # Model name for calling
        "max_tokens": 1024,  # Maximum number of output tokens.
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "hello"}
        ],
        # Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
        "stream": False,
        # Controls whether to show the number of tokens used during streaming output. This parameter is valid only when stream is set to True.
        # "stream_options": {"include_usage": True},
        # A floating-point number that controls the sampling randomness. Smaller values make the model more deterministic, while larger values make it more creative. The value 0 indicates greedy sampling. The default value is 0.6.
        "temperature": 0.6
    }
	response = requests.post(url, headers=headers, data=json.dumps(data), verify=False)
	# Print result.     
	print(response.status_code)     
	print(response.text)

Example call using cURL

curl -X POST "https://example.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{ 
    "model": "DeepSeek-R1",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello"}
    ],
    "stream": true,
    "stream_options": { "include_usage": true },
    "temperature": 0.6
  }'

Example call using the OpenAI SDK

# Install the environment.
pip install --upgrade "openai>=1.0"

# Example call using the OpenAI SDK
from openai import OpenAI

if __name__ == '__main__':
	base_url = "https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06******/v1"
	api_key = "<your_apiKey>"  # Replace <your_apiKey> with the obtained API key.

	client = OpenAI(api_key=api_key, base_url=base_url)

	response = client.chat.completions.create(
		model="******",
		messages=[
			{"role": "system", "content": "You are a helpful assistant"},
			{"role": "user", "content": "Hello"},
		],
		max_tokens=1024,
		temperature=0.6,
		stream=False
	)
	# Print result.     
        print(response.choices[0].message.content)

The model service API is the same as that of vLLM. Table 2 only covers key parameters. For more information, see the vLLM official website. When enabling stream output for a model using the 909 image, you need to add the stream_options parameter with the value {"include_usage": true} to print the number of tokens used.

**Table 2** Request parameters
Parameter	Mandatory	Default Value	Type	Description
url	Yes	None	Str	API URL. Assume that the URL is https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*/{endpoint}. The {endpoint} only supports the following APIs. For details, see API Calling. /v1/chat/completions /v1/models
model	Yes	None	Str	Model name for calling. To obtain the model name, go to the Real-Time Inference page of ModelArts Studio, and choose More > Call in the Operation column; from there, you can see the model name.
messages	Yes	N/A	Array	Input question of the request.
messages.role	Yes	None	Str	Different roles correspond to different message types. system: developer-entered instructions like response formats and roles for the model to follow. user: user-entered messages including prompts and context information. assistant: responses generated by the model. tool: information returned by the tool when the model calls it.
messages.content	Yes	None	Str	When role is set to system, this parameter indicates the AI model's personality. {"role": "system","content": "You are a helpful AI assistant."} When role is set to user, this parameter indicates the question asked by the user. {"role": "user","content": "Which number is larger, 9.11 or 9.8?"} When role is set to assistant, this parameter indicates the content output by the AI model. {"role": "assistant","content": "9.11 is larger than 9.8."} When role is set to tool, this parameter indicates the responses returned by the tool when the model calls it. {"role": "tool", "content": "The weather in Shanghai is sunny today. The temperature is 10°C."}
stream_options	No	None	Object	Controls whether to show the number of tokens used during streaming output. This parameter is valid only when stream is set to True. You need to set stream_options to {"include_usage": true} to print the number of tokens used.
max_tokens	No	16	Int	Maximum number of tokens that can be generated for the current task, including tokens generated by the model and reasoning tokens for deep thinking.
top_k	No	-1	Int	The candidate set size determines the sampling range during generation. For example, setting it to 50 means only the top 50 scoring tokens are sampled at each step. A larger size increases randomness; a smaller one makes the output more predictable.
top_p	No	1.0	Float	Nucleus sampling. It keeps only the words with combined probabilities above the threshold p and removes the rest. These selected words are then normalized and sampled again. Lower settings reduce word options, making outputs focused and cautious. Higher settings expand word choices, creating varied and creative outputs. Adjust either temperature or top_p separately for best results, not both at once. Value range: 0 to 1. The value 1 indicates that all tokens are considered.
temperature	No	0.6	Float	Model sampling temperature. The higher the value, the more random the model output; the lower the value, the more deterministic the output. Adjust either temperature or top_p separately for best results, not both at once. Recommended value of temperature: 0.6 for DeepSeek-R1, DeepSeek-V3, and Qwen3 series, and 0.2 for Qwen2.5-VL series.
stop	No	None	None/Str/List	A list of strings used to stop generation. The output does not contain the stop strings. For example, if the value is set to ["You," "Good"], text generation will stop once either You or Good is reached.
stream	No	False	Bool	Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
n	No	1	Int	Number of responses generated for each input message. If beam_search is not used, the recommended value range of n is 1 ≤ n ≤10. If n is greater than 1, ensure that greedy_sample is not used for sampling, that is, top_k is greater than 1 and temperature is greater than 0. If beam_search is used, the recommended value range of n is 1 < n ≤ 10. If n is 1, the inference request will fail. NOTE: For optimal performance, keep n at 10 or below. Large values of n can significantly slow down processing. Inadequate video RAM may cause inference requests to fail.
use_beam_search	No	False	Bool	Controls whether to use beam_search to replace sampling. When this parameter is used, the following parameters must be configured as required: n: > 1 top_p: 1.0 top_k: -1 temperature: 0.0
presence_penalty	No	0.0	Float	Applies rewards or penalties based on the presence of new words in the generated text. The value range is [-2.0,2.0].
frequency_penalty	No	0.0	Float	Applies rewards or penalties based on the frequency of each word in the generated text. The value range is [-2.0,2.0].
length_penalty	No	1.0	Float	Imposes a larger penalty on longer sequences in a beam search process. When this parameter is used, the following parameters must be configured as required: top_k: -1 use_beam_search: true best_of: > 1

The following shows the sample response of a common requests package, OpenAI SDK, and cURL command.

{
    "id": "cmpl-29f7a172056541449eb1f9d31c*****",
    "object": "chat.completion",
    "created": 17231*****,
    "model": "******",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Hello. I'm glad to help. Is there anything I can help with?"
            },
            "logprobs": null,
            "finish_reason": "stop",
            "stop_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 20,
        "total_tokens": 38,
        "completion_tokens": 18
    }
}

The following is a sample response of the chain-of-thought model:

messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(model=model, messages=messages)
reasoning_content = response.choices[0].message.reasoning_content
content = response.choices[0].message.content
print("reasoning_content:", reasoning_content)
print("content:", content)

**Table 3** Response parameters
Parameter	Type	Description
id	Str	Request ID.
object	Str	Request task.
created	Int	Timestamp when the request is created.
model	Str	Model to call.
choices	Array	Model-generated content.
usage	Object	Request input length, output length, and total length. prompt_tokens: number of input tokens. completion_tokens: number of output tokens. total_tokens: total number of tokens. Total number of tokens = Number of input tokens + Number of output tokens
reasoning_content	Str	Model's thought process when it supports a chain of thought. For models that support a chain of thought, when streaming output is enabled, the reasoning appears in the reasoning_content field first, followed by the answer in the content field.
content	Str	Response of the model.

If the calling fails, you can adjust the script or runtime environment based on the error code.

**Table 4** Common error codes
Error Code	Content	Description
400	Bad Request	The request contains syntax errors.
403	Forbidden	The server refused the request.
404	Not Found	The server cannot find the requested web page.
500	Internal Server Error	Internal service error.

Content Moderation Description

Streaming Request

When content moderation is triggered, you receive error code 403. Identify the specific fault using error code ModelArts.81011. Returned information:
```
{
    "error_code": "ModelArts.81011",
    "error_msg": "May contain sensitive information, please try again."
}
```
Figure 1 Error message example
If content moderation is not triggered, use Postman to call the API. The return code is 200.
Figure 2 Example successful response

If the output contains sensitive information, the following data will be appended to the output stream:

data: {"id":"chatcmpl-*********************","object":"chat.completion","created":1678067605,"model":"******","choices":[{"delta":{"content":"This is the start of the streaming response."},"index":0}]}
data: {"id":"chatcmpl-*********************","object":"chat.completion","created":1678067605,"model":"******","choices":[{"delta":{"content":"Continue outputting the result."},"index":0}]}
data: {"id":"chatcmpl-*********************","object":"chat.completion","created":1678067605,"model":"******","choices":[{"finish_reason":"content_filter","index":0}]}
data: [DONE]

After content moderation is triggered, the finish_reason is content_filter; the normal streaming stop is "finish_reason":"stop".

Non-streaming Request
- When content moderation is triggered, you receive error code 403. Identify the specific fault using error code ModelArts.81011.
  Returned information:
```
{
    "error_code": "ModelArts.81011",
    "error_msg": "May contain sensitive information, please try again."
}
```
- If content moderation is not triggered, the following information is returned.
  Figure 3 Example successful response

API Calling

Assume that the API URL is https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*****/{endpoint}. The {endpoint} parameter only supports the following APIs:

/v1/chat/completions
/v1/models

Notes:

/v1/models does not need a request body for GET requests. However, /v1/chat/completions requires the POST method and a JSON request body.
The common request header is Authorization: Bearer YOUR_API_KEY. For POST requests, Content-Type: application/json is also required.

**Table 5** APIs
Type/API	/v1/models	/v1/chat/completions
Request method	GET	POST
Usage	Obtains the list of supported models.	Chats with users.
Request body	No request body is needed. Just add the authentication information to the request header.	model: identifier of the model used. messages: an array of messages. Each message must contain role (for example, user or assistant) and content. Optional parameters like temperature and max_tokens control the diversity and length of the output.
Example request	GET https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*****/v1/models HTTP/1.1 Authorization: Bearer YOUR_API_KEY	POST https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06***/v1/chat/completions HTTP/1.1 Content-Type: application/json Authorization: Bearer YOUR_API_KEY { "model": "****", "messages": [ {"role": "user", "content": "Hello, how are you?"} ], "temperature": 0.7 }
Response example	{ "data": [ { "id": "****", "description": "Next-generation foundation model" }, { "id": "****", "description": "Cost-effective alternative solution" } ] }	{ "id": "******", "object": "chat.completion", "choices": [ { "index": 0, "message": {"role": "assistant", "content": "I'm doing well, thank you! How can I help you today?"} } ], "usage": { "prompt_tokens": 15, "completion_tokens": 25, "total_tokens": 40 } }