Viewing the Call Data and Monitoring Metrics of Real-Time Inference on ModelArts Studio (MaaS)

MaaS offers call statistics. You can view call data and monitoring metrics for your services, built-in commercial services, and endpoints over a set time. This includes total calls, failed calls, total tokens, input tokens, output tokens, and end-to-end latency. Data trends are shown hourly, which helps you understand usage and performance changes. This helps you evaluate models, find issues, fix problems, and improve performance.

Operation Scenarios

Resource consumption monitoring: Track the usage of tokens for model services to avoid overuse.
Cost analysis: Optimize call strategies to reduce costs based on the distribution of input and output tokens.
Performance metrics: View various model performance metrics to perform performance optimization.
Service optimization: Analyze the relationship between call frequency and token consumption to adjust service configurations or scaling plans.
Troubleshooting: Quickly identify issues such as sudden increases in calls, abnormal consumption, and call failures during specific time periods.

Constraints

Regions:
- The CN-Hong Kong region allows you to view the call data and metrics of My Services and Built-in Services > Commercial Services.
- The ME-Riyadh region allows you to view the call data and metrics of endpoints.
Statistical scope:
- Only the call data of built-in commercial services, endpoints, and your custom services is collected.
- Only data from API calls is collected.
Data update delay: Call data updates may take 1 to 2 hours, so they will not show the most recent call activity right away.
Time range:
- You can choose predefined time ranges like today, yesterday, or the last 3, 7, or 14 days.
- You can also choose a custom period of up to 30 days.

Billing

There is no charge for call statistics.
When MaaS calls models, related resources may be billed. For details, see Model Inference Billing Items.

Prerequisites

Either of the following conditions must be met:

You have subscribed to a built-in commercial service and generated call records. For details, see Subscribing to a Built-in Commercial Service in ModelArts Studio (MaaS).
You have created an endpoint and generated call records. For details, see Creating an Endpoint on ModelArts Studio (MaaS).
You have deployed a model service on the My Services page and generated call records. For details, see Deploying a Model Service in ModelArts Studio (MaaS).

Viewing Service Call Monitoring Data

On the Call Statistics page, you can view details about the data generated by API calls of all services or a single service.

Log in to the ModelArts Studio (MaaS) console and select the target region on the top navigation bar.
In the navigation pane on the left, choose Management and Statistics > Call Statistics.

In the Real-Time Inference tab, set the time range, service type, calling method, and IP address as required.

**Table 1** Call statistics filtering parameters
Parameter		Description
Time Range		Service call statistics can be collected by today, yesterday, last 3 days, last 7 days, last 14 days, or a custom time range. Time range and precision filtering rules: For time ranges ≤ 1 day, supported precisions are: per minute, per hour, and per day. For time ranges 2–7 days, supported precisions are: per hour and per day. For time ranges 8–30 days, supported precisions is: per day.
Service Type	My Services	Model services deployed on the My Services page. For more information, see Deploying a Model Service in ModelArts Studio (MaaS).
	Built-in Services-Commercial Services	Commercial services subscribed in the Built-in Services > Commercial Services tab. For more information, see Subscribing to a Built-in Commercial Service in ModelArts Studio (MaaS).
	Endpoint	Endpoints created in the Endpoint tab. For more information, see Creating an Endpoint on ModelArts Studio (MaaS).
Calling Method		All API keys work by default to authenticate calls to a model service on MaaS. You can choose specific ones if needed. For more information, see Calling a Model Service in ModelArts Studio (MaaS) and Managing API Keys in ModelArts Studio (MaaS).
IP Address		The client source IP address (public IP) that has generated calls is derived from the http_x_forwarded_for field in the APIG logs. When this field contains multiple values, the system will use the first value; if the field value is -, it will be displayed as an empty string. The IP address is displayed by default as All, but you can also select specific IP addresses as needed.

In the Real-Time Inference tab, view the total number of calls, total number of failed calls, and total number of tokens used.

By default, monitoring metrics are accurate to three decimal places.

**Table 2** Service parameters
Parameter	Description
Total Calls	The total number of service calls.
Total Failed Calls	The total number of failed calls, that is, the sum of 4xx and 5xx errors.
Total Tokens Used	The total number of tokens used by service calls.
Input Tokens	The number of input tokens used by service calls.
Output Tokens	The number of output tokens used by service calls.

In the Services area of the Real-Time Inference tab, view the number of calls, number of failed calls, and call failure rate of a single service.

The service list only displays subscribed built-in commercial services, created endpoints, and your custom services.

**Table 3** Service list parameters
Parameter	Description
Service Name/Version	The name or version of the called service. Only commercial services support service versions. You can click to view the statistics of each service version.
Calls	The number of service calls.
Failed Calls	The number of failed service calls.
Call Failure Rate (%)	The percentage of failed calls.
Total Tokens	The total number of tokens used by service calls.
Input Tokens	The total number of input tokens.
Output Tokens	The total number of output tokens.
E2E Latency (ms)	The end-to-end latency of successful requests per unit time
Time to First Token (ms)	The time from receiving a request to generating the first output token.
Incremental Token Latency (ms)	The interval between generating each output token.
Average generation time (s)	The average time taken to generate each image/video.

If a metric is displayed as -, the metric is not involved in the service. In the Monitoring tab of the Service Call Details page, only service-related metrics are displayed.

In the Services area of the Real-Time Inference tab, click View Monitoring Details on the right of the target service. On the Service Call Details page, view the call information in the Monitoring or Failed Calls tab.

In the upper part of the page, you can click the service name to switch the service or select the service version as required (only commercial services support service versioning). The service switchover only displays subscribed built-in commercial services, created endpoints, and your custom services.

The Monitoring tab displays the number of calls, call failure rate, input token size, output token size, and end-to-end latency of the service.

**Table 4** Monitoring parameters
Parameter		Description
Filter criteria	Time Range	The default value is the time range selected in the Real-Time Inference tab. You can change the value as required.
	Time Precision	The time precision depends on the selected time range. The filtering rules are as follows: For time ranges ≤ 1 day, supported precisions are: per minute, per hour, and per day. For time ranges 2–7 days, supported precisions are: per hour and per day. For time ranges 8–30 days, supported precision is: per day.
	Calling Method	The default value is the calling method selected in the Real-Time Inference tab. You can change the value as required.
	IP Address	The default value is the IP address selected in the Real-Time Inference tab. You can change the value as required.
Monitoring metrics	Calls (Times)	The number of service calls, successes, and failures.
	Tokens consumed	Total tokens used by service calls per unit time.
	Time to First Token	The time from receiving a request to generating the first output token. The statistics are collected only for streaming responses. Due to model version constraints, certain versions do not support displaying this metric in non-streaming scenarios. Upgrade to the latest version to view this metric. For details about how to upgrade a model service, see Upgrading a Model Service in ModelArts Studio (MaaS). AVG: The average value of the first token latency. MAX: The maximum value of the first token latency. P50: 50% of the first token latencies are below this value. P80: 80% of the first token latencies are below this value. P90: 90% of the first token latencies are below this value. P99: 99% of the first token latencies are below this value.
	Input Tokens (K Tokens)	The input sequence length. AVG: The average length of the input token. MAX: The maximum length of the input token. P50: 50% of the input token lengths are below this value. P80: 80% of the input token lengths are below this value. P90: 90% of the input token lengths are below this value. P99: 99% of the input token lengths are below this value.
	RPM (Times/Minute)	The number of requests processed per minute.
	Call Failure Rate (%)	The percentage of failed calls.
	Error Occurrences	Indicates how many times each error code occurs.
	E2E Latency (ms)	The end-to-end latency of successful requests per unit time AVG: The average value of the end-to-end latency. MAX: The maximum end-to-end delay. P50: 50% of the end-to-end latencies are below this value. P80: 80% of the end-to-end latencies are below this value. P90: 90% of the end-to-end latencies are below this value. P99: 99% of the end-to-end latencies are below this value.
	Incremental Token Latency (ms)	The interval between generating each output token. The statistics are collected only for streaming responses. Due to model version constraints, certain versions do not support displaying this metric in non-streaming scenarios. Upgrade to the latest version to view this metric. For details about how to upgrade a model service, see Upgrading a Model Service in ModelArts Studio (MaaS). AVG: The average value of the incremental token latency. MAX: The maximum value of the incremental token latency. P50: 50% of the incremental token latencies are below this value. P80: 80% of the incremental token latencies are below this value. P90: 90% of the incremental token latencies are below this value. P99: 99% of the incremental token latencies are below this value.
	Output Tokens (K Tokens)	The output sequence length. AVG: The average length of the output token. MAX: The maximum length of the output token. P50: 50% of the output token lengths are below this value. P80: 80% of the output token lengths are below this value. P90: 90% of the output token lengths are below this value. P99: 99% of the output token lengths are below this value.
	TPM (K Tokens/Minute)	The number of tokens processed per minute, including both input and output.
	Average Generation Time (s)	The average time taken to generate each image/video.

In the Failed Calls tab, you can view the information about failed calls, such as the error code, number of occurrences, and error information, to locate and rectify the fault.

**Table 5** Parameters for failed call details
Parameter		Description
Filter criteria	Time Range	The default value is the time range selected in the Real-Time Inference tab. You can change the value as required.
	Calling Method	The default value is the calling method selected in the Real-Time Inference tab. You can change the value as required.
	IP Address	The default value is the IP address selected in the Real-Time Inference tab. You can change the value as required.
Error message	Error Code	Error code, including 4xx and 5xx. Click the before 4xx or 5xx to view the error code details, number of occurrences, proportion, and error information.
	Occurrences	Number of 4xx and 5xx errors.
	Percentage (%)	Percentage of errors caused by this error code.
	Error Message	Description of 4xx and 5xx errors.

Exporting Service Call Monitoring Data

The Service Call Details page enables monitoring data export. You can export the line chart of all or specified monitoring metrics.

Choose Management and Statistics > Real-Time Inference. In the Services area, click View Monitoring Details on the right of the target service.
On the Service Call Details page, select the time range, service type, calling method, and IP address as required.
For details about the parameters, see Table 4.
In the upper right corner of the page, click Export.
In the displayed dialog box, select monitoring metrics as required (all metrics are selected by default) and click OK.
The exported file is in XLSX format. Each sheet corresponds to a line chart of a monitoring metric. The line chart consists of a time column and a metric column.

FAQs

Why can't I find the consumed tokens after calling the model?
Due to data update delays, the update of consumed tokens and other statistical data is delayed on an hourly basis. Please wait patiently and check again later.
What is the logic for counting input and output tokens?
- Input tokens: The total number of tokens after tokenization of the text in your request.
- Output tokens: The total number of tokens in the model's response, including the end-of-sequence token.

Parent topic: Viewing Call Data and Monitoring Metrics of ModelArts Studio (MaaS)

Previous topic: Viewing Call Data and Monitoring Metrics of ModelArts Studio (MaaS)

Next topic: ModelArts Studio (MaaS) Business Best Practices