Updated on 2025-09-29 GMT+08:00

Viewing the Call Data and Monitoring Metrics of Real-Time Inference on ModelArts Studio (MaaS)

MaaS offers call statistics. You can view call data and monitoring metrics for your services and built-in commercial services over a set time. This includes total calls, failed calls, total tokens, input tokens, output tokens, and average response latency. Data trends are shown hourly, which helps you understand usage and performance changes. This helps you evaluate models, find issues, fix problems, and improve performance.

Operation Scenarios

  • Resource consumption monitoring: Track the usage of tokens for model services to avoid overuse.
  • Cost analysis: Optimize call strategies to reduce costs based on the distribution of input and output tokens.
  • Performance metrics: View various model performance metrics to perform performance optimization.
  • Service optimization: Analyze the relationship between call frequency and token consumption to adjust service configurations or scaling plans.
  • Troubleshooting: Quickly identify issues such as sudden increases in calls, abnormal consumption, and call failures during specific time periods.

Constraints

  • Statistical scope:
    • Only the call data of built-in commercial services and your custom services is collected.
    • Only data from API calls is collected.
  • Data update delay: Call data updates may take 1 to 2 hours, so they will not show the most recent call activity right away.
  • Time range:
    • You can choose predefined time ranges like today, yesterday, or the last 3, 7, or 14 days.
    • You can also choose a custom period of up to 30 days.

Billing

  • There is no charge for call statistics.
  • When MaaS calls models, related resources may be billed. For details, see Model Inference Billing Items.

Prerequisites

Either of the following conditions must be met:

Viewing Service Call Monitoring Data

On the Calls page, you can view details about the data generated by API calls of all services or a single service.

  1. Log in to the ModelArts Studio (MaaS) console and select the target region on the top navigation bar.
  2. In the navigation pane on the left, choose Management and Statistics > Calls.
  3. In the Real-Time Inference tab, set the time range, service type, calling method, and IP address as required.
    Table 1 Call statistics filtering parameters

    Parameter

    Description

    Time Range

    Service call statistics can be collected by today, yesterday, last 3 days, last 7 days, last 14 days, or a custom time range.

    Time range and precision filtering rules:
    • For time ranges ≤ 1 day, supported precisions are: per minute, per hour, and per day.
    • For time ranges 2–7 days, supported precisions are: per hour and per day.
    • For time ranges 8–30 days, supported precisions is: per day.

    Service Type

    My Services

    Model services deployed on the My Services page. For more information, see Deploying a Model Service in ModelArts Studio (MaaS).

    Built-in Services-Commercial Services

    Commercial services subscribed in the Built-in Services > Commercial Services tab. For more information, see Subscribing to a Built-in Commercial Service in ModelArts Studio (MaaS).

    Calling Method

    All API keys work by default to authenticate calls to a model service on MaaS. You can choose specific ones if needed. For more information, see Calling a Model Service in ModelArts Studio (MaaS) and Managing API Keys in ModelArts Studio (MaaS).

    IP Address

    The client source IP address (public IP) that has generated calls is derived from the http_x_forwarded_for field in the APIG logs. When this field contains multiple values, the system will use the first value; if the field value is -, it will be displayed as an empty string.

    The IP address is displayed by default as All, but you can also select specific IP addresses as needed.

  4. In the Real-Time Inference tab, view the total number of calls, total number of failed calls, and total number of tokens used.

    By default, monitoring metrics are accurate to three decimal places.

    Table 2 Service parameters

    Parameter

    Description

    Total Calls

    The total number of service calls.

    Total Failed Calls

    The total number of failed calls, that is, the sum of 4xx and 5xx errors.

    Total Tokens Used

    The total number of tokens used by service calls.

    Input Tokens

    The number of input tokens used by service calls.

    Output Tokens

    The number of output tokens used by service calls.

  5. In the Services area of the Real-Time Inference tab, view the number of calls, number of failed calls, and call failure rate of a single service.

    The service list only displays subscribed built-in commercial services and your custom services.

    Table 3 Service list parameters

    Parameter

    Description

    Service Name/Version

    The name or version of the called service.

    Only commercial services support service versions. You can click to view the statistics of each service version.

    Calls

    The number of service calls.

    Failed Calls

    The number of failed service calls.

    Call Failure Rate (%)

    The percentage of failed calls.

    Total Tokens

    The total number of tokens used by service calls.

    Input Tokens

    The total number of input tokens.

    Output Tokens

    The total number of output tokens.

    Average Response Latency (ms)

    The average response time of successful requests per unit of time.

    Time to First Token (ms)

    The time from receiving a request to generating the first output token.

    Incremental Token Latency (ms)

    The interval between generating each output token.

    Average generation time (s)

    The average time taken to generate each image/video.

    If a metric is displayed as -, the metric is not involved in the service. In the Monitoring tab of the Service Call Details page, only service-related metrics are displayed.

  6. In the Services area of the Real-Time Inference tab, click View Monitoring Details on the right of the target service. On the Service Call Details page, view the call information in the Monitoring or Failed Calls tab.

    In the upper part of the page, you can click the service name to switch the service or select the service version as required (only commercial services support service versioning). The service switchover only displays subscribed built-in commercial services and your custom services.

    • The Monitoring tab displays the number of calls, call failure rate, input token size, output token size, and average response latency of the service.
      Table 4 Monitoring parameters

      Parameter

      Description

      Filter criteria

      Time Range

      The default value is the time range selected in the Real-Time Inference tab. You can change the value as required.

      Time Precision

      The time precision depends on the selected time range. The filtering rules are as follows:

      • For time ranges ≤ 1 day, supported precisions are: per minute, per hour, and per day.
      • For time ranges 2–7 days, supported precisions are: per hour and per day.
      • For time ranges 8–30 days, supported precision is: per day.

      Calling Method

      The default value is the calling method selected in the Real-Time Inference tab. You can change the value as required.

      IP Address

      The default value is the IP address selected in the Real-Time Inference tab. You can change the value as required.

      Monitoring metrics

      Calls (Times)

      The number of service calls, successes, and failures.

      Tokens (K Tokens)

      The number of total tokens, input tokens, and output tokens used.

      Time to First Token

      The time from receiving a request to generating the first output token. The statistics are collected only for streaming responses. Due to model version constraints, certain versions do not support displaying this metric in non-streaming scenarios. Upgrade to the latest version to view this metric. For details about how to upgrade a model service, see Upgrading a Model Service in ModelArts Studio (MaaS).

      • AVG: The average value of the first token latency.
      • MAX: The maximum value of the first token latency.
      • P50: 50% of the first token latencies are below this value.
      • P80: 80% of the first token latencies are below this value.
      • P90: 90% of the first token latencies are below this value.
      • P99: 99% of the first token latencies are below this value.

      Input Tokens (K Tokens)

      The input sequence length.

      • AVG: The average length of the input token.
      • MAX: The maximum length of the input token.
      • P50: 50% of the input token lengths are below this value.
      • P80: 80% of the input token lengths are below this value.
      • P90: 90% of the input token lengths are below this value.
      • P99: 99% of the input token lengths are below this value.

      RPM (Times/Minute)

      The number of requests processed per minute.

      Call Failure Rate (%)

      The percentage of failed calls.

      Error Occurrences

      Indicates how many time each error code occurs.

      Average Response Latency (ms)

      The average response time of successful requests per unit of time.

      Incremental Token Latency (ms)

      The interval between generating each output token. The statistics are collected only for streaming responses. Due to model version constraints, certain versions do not support displaying this metric in non-streaming scenarios. Upgrade to the latest version to view this metric. For details about how to upgrade a model service, see Upgrading a Model Service in ModelArts Studio (MaaS).

      • AVG: The average value of the incremental token latency.
      • MAX: The maximum value of the incremental token latency.
      • P50: 50% of the incremental token latencies are below this value.
      • P80: 80% of the incremental token latencies are below this value.
      • P90: 90% of the incremental token latencies are below this value.
      • P99: 99% of the incremental token latencies are below this value.

      Output Tokens (K Tokens)

      The output sequence length.

      • AVG: The average length of the output token.
      • MAX: The maximum length of the output token.
      • P50: 50% of the output token lengths are below this value.
      • P80: 80% of the output token lengths are below this value.
      • P90: 90% of the output token lengths are below this value.
      • P99: 99% of the output token lengths are below this value.

      TPM (K Tokens/Minute)

      The number of tokens processed per minute, including both input and output.

      Average Generation Time (s)

      The average time taken to generate each image/video.

    • In the Failed Calls tab, you can view the information about failed calls, such as the error code, number of occurrences, and error information, to locate and rectify the fault.
      Table 5 Parameters for failed call details

      Parameter

      Description

      Filter criteria

      Time Range

      The default value is the time range selected in the Real-Time Inference tab. You can change the value as required.

      Calling Method

      The default value is the calling method selected in the Real-Time Inference tab. You can change the value as required.

      IP Address

      The default value is the IP address selected in the Real-Time Inference tab. You can change the value as required.

      Error message

      Error Code

      Error code, including 4xx and 5xx. Click the before 4xx or 5xx to view the error code details, number of occurrences, proportion, and error information.

      Occurrences

      Number of 4xx and 5xx errors.

      Percentage (%)

      Percentage of errors caused by this error code.

      Error Message

      Description of 4xx and 5xx errors.

Exporting Service Call Monitoring Data

The Service Call Details page enables monitoring data export. You can export the line chart of all or specified monitoring metrics.

  1. Choose Management and Statistics > Real-Time Inference. In the Services area, click View Monitoring Details on the right of the target service.
  2. On the Service Call Details page, select the time range, service type, calling method, and IP address as required.

    For details about the parameters, see Table 4.

  3. In the upper right corner of the page, click Export.
  4. In the displayed dialog box, select monitoring metrics as required (all metrics are selected by default) and click OK.

    The exported file is in XLSX format. Each sheet corresponds to a line chart of a monitoring metric. The line chart consists of a time column and a metric column.

FAQs

  1. Why can't I find the consumed tokens after calling the model?

    Due to data update delays, the update of consumed tokens and other statistical data is delayed on an hourly basis. Please wait patiently and check again later.

  2. What is the logic for counting input and output tokens?
    • Input tokens: The total number of tokens after tokenization of the text in your request.
    • Output tokens: The total number of tokens in the model's response, including the end-of-sequence token.