Updated on 2025-08-04 GMT+08:00

Calling a Model Service in ModelArts Studio (MaaS)

Model services deployed in ModelArts Studio can be called in other service environments. This section describes how to call your deployed model from My Services. You can also call built-in services (free, commercial, or custom access point).

Description

When developing AI applications, developers need to deploy trained models into real-world services. This often involves manually setting up environments, handling dependencies, and writing deployment scripts. This approach takes significant time, risks errors, and leads to challenges like complicated setups, hard migrations, costly maintenance, and awkward updates.

MaaS offers a one-stop solution with unified APIs for system integration, along with built-in monitoring and logging features for efficient O&M.

Billing

Model inference services process input by converting it into identifiable tokens. Using a built-in MaaS service charges you based on token count. For details, see Billing.

Prerequisites

In the My Services tab of the Real-Time Inference page, there is a model service in the Running, Updating, or Upgrading state. For details, see Deploying a Model Service in ModelArts Studio (MaaS).

Step 1: Obtaining an API Key

When calling a model service deployed in MaaS, you need to enter an API key for API authentication. You can create a up to 30 keys. Each key is displayed only once after creation. Keep it secure. If the key is lost, it cannot be retrieved. In this case, create a new API key. For more information, see Managing API Keys in ModelArts Studio (MaaS).

  1. Log in to the ModelArts Studio console and select the target region on the top navigation bar.
  2. In the navigation pane, choose API Keys.
  3. On the API Keys page, click Create API Key, enter the tag and description, and click OK.

    The tag and description cannot be modified after the key is created.

    Table 1 Parameters

    Parameter

    Description

    Tag

    Tag of the API key. The tag must be unique. The tag can contain 1 to 100 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed.

    Description

    Description of the custom API key. The value can contain 1 to 100 characters.

  4. In the Your Key dialog box, copy the key and store it securely.
  5. After the key is saved, click Close.

    After you click Close, the key cannot be viewed again.

Step 2: Calling a Model Service for Prediction

  1. Log in to the ModelArts Studio console and choose Real-Time Inference in the navigation pane on the left.
  2. On the Real-Time Inference page, click the My Services tab. Choose More > View Call Description in the Operation column of the target service.
  3. On the displayed page, select an API type, copy the example call, modify the API information and API key, and use the information to call the model service API in the service environment.

    The following shows sample code for REST APIs and OpenAI SDK.

    • Use a common requests package.
      import requests
      import json
      
      if __name__ == '__main__':
          url = "https:/example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*****/v1/chat/completions"
          api_key = "<your_apiKey>"  # Replace <your_apiKey> with the obtained API key.
      
          # Send a request.
          headers = {
              'Content-Type': 'application/json',
              'Authorization': f'Bearer {api_key}'
          }
          data = {
              "model": "******",  # Model name for calling
              "max_tokens": 1024,  # Maximum number of output tokens.
              "messages": [
                  {"role": "system", "content": "You are a helpful assistant."},
                  {"role": "user", "content": "hello"}
              ],
              # Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.
              "stream": False,
              # Controls whether to show the number of tokens used during streaming output. This parameter is valid only when stream is set to True.
              # "stream_options": {"include_usage": True},
              # A floating-point number that controls the sampling randomness. Smaller values make the model more deterministic, while larger values make it more creative. The value 0 indicates greedy sampling. The default value is 0.6.
              "temperature": 0.6
          }
      	response = requests.post(url, headers=headers, data=json.dumps(data), verify=False)
      	# Print result.     
      	print(response.status_code)     
      	print(response.text)
    • Use the OpenAI SDK.
      from openai import OpenAI
      
      if __name__ == '__main__':
      	base_url = "https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06******/v1"
      	api_key = "<your_apiKey>"  # Replace <your_apiKey> with the obtained API key.
      
      	client = OpenAI(api_key=api_key, base_url=base_url)
      
      	response = client.chat.completions.create(
      		model="******",
      		messages=[
      			{"role": "system", "content": "You are a helpful assistant"},
      			{"role": "user", "content": "Hello"},
      		],
      		max_tokens=1024,
      		temperature=0.6,
      		stream=False
      	)
      	# Print result.     
              print(response.choices[0].message.content)

    The model service API is the same as that of vLLM. Table 2 only covers key parameters. For more information, see the vLLM official website. When enabling stream output for a model using the Ascend Cloud 909 image, you need to add the stream_options parameter with the value {"include_usage": true} to print the number of tokens used.

    Table 2 Request parameters

    Parameter

    Mandatory

    Default Value

    Type

    Description

    url

    Yes

    None

    Str

    API URL. Assume that the URL is https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*****/{endpoint}. The {endpoint} only supports the following APIs. For details, see API Calling.

    • /v1/chat/completions
    • /v1/models

    model

    Yes

    None

    Str

    Model name for calling.

    To obtain the model name, go to the Real-Time Inference page of ModelArts Studio, and choose More > Call in the Operation column; from there, you can see the model name.

    messages

    Yes

    N/A

    Array

    Input question of the request.

    stream_options

    No

    None

    Object

    Controls whether to show the number of tokens used during streaming output. This parameter is valid only when stream is set to True. You need to set stream_options to {"include_usage": true} to print the number of tokens used.

    max_tokens

    No

    16

    Int

    Maximum number of tokens to be generated for each output sequence.

    top_k

    No

    -1

    Int

    Determines how many of the highest ranking tokens are considered. The value -1 indicates that all tokens are considered.

    Decreasing the value can reduce the sampling time.

    top_p

    No

    1.0

    Float

    A floating point number that controls the cumulative probability of the first several tokens to be considered.

    Value range: 0 to 1

    The value 1 indicates that all tokens are considered.

    temperature

    No

    0.6

    Float

    A floating-point number that controls the sampling randomness. Smaller values make the model more deterministic, while larger values make it more creative. The value 0 indicates greedy sampling.

    stop

    No

    None

    None/Str/List

    A list of strings used to stop generation. The output does not contain the stop strings.

    For example, if the value is set to ["You," "Good"], text generation will stop once either You or Good is reached.

    stream

    No

    False

    Bool

    Controls whether to enable streaming inference. The default value is False, indicating that streaming inference is disabled.

    n

    No

    1

    Int

    Multiple normal results are returned.

    • If beam_search is not used, the recommended value range of n is 1 ≤ n ≤10. If n is greater than 1, ensure that greedy_sample is not used for sampling, that is, top_k is greater than 1 and temperature is greater than 0.
    • If beam_search is used, the recommended value range of n is 1 < n ≤ 10. If n is 1, the inference request will fail.
    NOTE:

    For optimal performance, keep n at 10 or below. Large values of n can significantly slow down processing. Inadequate video RAM may cause inference requests to fail.

    use_beam_search

    No

    False

    Bool

    Controls whether to use beam_search to replace sampling.

    When this parameter is used, the following parameters must be configured as required:

    • n: > 1
    • top_p: 1.0
    • top_k: -1
    • temperature: 0.0

    presence_penalty

    No

    0.0

    Float

    Applies rewards or penalties based on the presence of new words in the generated text. The value range is [-2.0,2.0].

    frequency_penalty

    No

    0.0

    Float

    Applies rewards or penalties based on the frequency of each word in the generated text. The value range is [-2.0,2.0].

    length_penalty

    No

    1.0

    Float

    Imposes a larger penalty on longer sequences in a beam search process.

    When this parameter is used, the following parameters must be configured as required:

    • top_k: -1
    • use_beam_search: true
    • best_of: > 1

    ignore_eos

    No

    False

    Bool

    Indicates whether to ignore EOS and continue to generate tokens.

    • The following shows the sample response of a common requests package and OpenAI SDK.
      {
          "id": "cmpl-29f7a172056541449eb1f9d31c*****",
          "object": "chat.completion",
          "created": 17231*****,
          "model": "******",
          "choices": [
              {
                  "index": 0,
                  "message": {
                      "role": "assistant",
                      "content": "Hello. I'm glad to help. Is there anything I can help with?"
                  },
                  "logprobs": null,
                  "finish_reason": "stop",
                  "stop_reason": null
              }
          ],
          "usage": {
              "prompt_tokens": 20,
              "total_tokens": 38,
              "completion_tokens": 18
          }
      }
    • The following is a sample response of the chain-of-thought model:
      messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
      response = client.chat.completions.create(model=model, messages=messages)
      reasoning_content = response.choices[0].message.reasoning_content
      content = response.choices[0].message.content
      print("reasoning_content:", reasoning_content)
      print("content:", content)
    Table 3 Response parameters

    Parameter

    Type

    Description

    id

    Str

    Request ID

    object

    Str

    Request task

    created

    Int

    Timestamp when the request is created

    model

    Str

    Model to call

    choices

    Array

    Model-generated content

    usage

    Object

    Request input length, output length, and total length

    • prompt_tokens: number of input tokens.
    • completion_tokens: number of output tokens.
    • total_tokens: total number of tokens.

    Total number of tokens = Number of input tokens + Number of output tokens

    reasoning_content

    Str

    Model's thought process when it supports a chain of thought. For models that support a chain of thought, when streaming output is enabled, the reasoning appears in the reasoning_content field first, followed by the answer in the content field.

    content

    Str

    Response of the model.

    If the calling fails, you can adjust the script or runtime environment based on the error code.
    Table 4 Common error codes

    Error Code

    Content

    Description

    400

    Bad Request

    The request contains syntax errors.

    403

    Forbidden

    The server refused the request.

    404

    Not Found

    The server cannot find the requested web page.

    500

    Internal Server Error

    Internal service error.

API Calling

Assume that the API URL is https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*****/{endpoint}. The {endpoint} parameter only supports the following APIs:

  • /v1/chat/completions
  • /v1/models

Notes:

  • /v1/models does not need a request body for GET requests. However, /v1/chat/completions requires the POST method and a JSON request body.
  • The common request header is Authorization: Bearer YOUR_API_KEY. For POST requests, Content-Type: application/json is also required.
Table 5 APIs

Type/API

/v1/models

/v1/chat/completions

Request method

GET

POST

Usage

Obtains the list of supported models.

Chats with users.

Request body

No request body is needed. Just add the authentication information to the request header.

  • model: identifier of the model used.
  • messages: an array of messages. Each message must contain role (for example, user or assistant) and content.
  • Optional parameters like temperature and max_tokens control the diversity and length of the output.

Example request

GET https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*****/v1/models HTTP/1.1
Authorization: Bearer YOUR_API_KEY
POST https://example.com/v1/infers/937cabe5-d673-47f1-9e7c-2b4de06*****/v1/chat/completions HTTP/1.1
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY

{
  "model": "******",
  "messages": [
    {"role": "user", "content": "Hello, how are you?"}
  ],
  "temperature": 0.7
}

Response example

{
  "data": [
    {
      "id": "******",
      "description": "Next-generation foundation model"
    },
    {
      "id": "******",
      "description": "Cost-effective alternative solution"
    }
  ]
}
{
  "id": "******",
  "object": "chat.completion",
  "choices": [
    {
      "index": 0,
      "message": {"role": "assistant", "content": "I'm doing well, thank you! How can I help you today?"}
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 25,
    "total_tokens": 40
  }
}

FAQs

How Long Does It Take for an API Key to Become Valid After It Is Created in MaaS?

A MaaS API key becomes valid a few minutes after creation.