Live Comparison

Overview

A critical challenge in the industrialization of AI is selecting the most suitable model from a vast array of foundation models and fine-tuned versions. The live comparison feature provides an intuitive evaluation platform, allowing you to perform side-by-side benchmarking of different models using identical inputs.

Core functions and value:

Foundation model selection: In the early stages of a project, compare the performance of different models like DeepSeek, Qwen3, and GLM to find the best fit for your business needs.
Fine-tuning validation: By performing synchronous comparisons between a native foundation model and its fine-tuned version, visually verify whether domain-specific knowledge has been successfully injected or if any catastrophic forgetting has occurred.
Parameter strategy optimization (A/B testing): Compare output variations of the same model using different hyperparameters (such as Temperature or Top_P) to determine the optimal inference configuration.

Prerequisites

An inference service has been deployed. For details, see Deploying a Model as a Real-Time Service.

Constraints

Console restrictions: This feature is only available in the new console.
Model type restrictions: Currently, only text generation models in the LLM domain are supported for comparison. Models in other domains are not supported.
Quantity limits: To ensure optimal frontend rendering performance and facilitate manual side-by-side evaluation, a maximum of three models can be compared simultaneously in a single task.
Timeout limits: During a real-time comparison task, if a model fails to complete its response within 5 minutes due to reasoning latency or performance issues, the corresponding window will trigger a timeout alert and terminate the generation.
Chat history limits: The system automatically saves the 100 most recent chat history records for each IAM user within the same workspace. When the number of records exceeds 100, the oldest record is immediately deleted to maintain the 100-record limit. Additionally, for chat histories older than 7 days, the system automatically deletes the oldest records daily at 00:00.

Procedure

Log in to the ModelArts console. In the navigation pane on the left, choose Model Evaluation > Live Comparison.
There are multiple entry points for live comparison. In addition to the left navigation pane, you can also use the following methods:
- In the navigation pane on the left, choose Model Inference > Real-Time Inference. Click Live Comparison in the Operation column on the right.
- In the navigation pane on the left, choose Model Inference > Real-Time Inference. Click the target service name to go to the service details page. Click Live Comparison in the upper right corner.
In the upper right corner of the Live Comparison page, click Service Comparison. In the Live Comparison | Select Service dialog box, select one to three services as required and click OK.
Configure service parameters as required.
Click next to the service name to set parameters like system persona, temperature, and top_p. This changes how random and diverse the output can be. To ensure a fair evaluation (controlled variables), keep the parameter configurations consistent across all selected services.

For details about service parameters and typical scenarios, see Service Parameter Configuration.

Click a preset question in the middle of the page or enter a question in the text box. Click Click to enlarge

or press Enter on the keyboard to send the question. Press Shift+Enter to start a new line.

The system will send this question to all selected models.
Both single-turn Q&A and multi-turn dialogues in the current context are supported.

If multiple models are selected for comparison, the system displays the generation results of each model in a side-by-side view. This allows you to evaluate and compare the models' performance regarding logical consistency, formatting accuracy, and semantic precision.

The total time taken and the thinking time will show below the model's answer. For details, see Metrics.

You can perform the following operations on the Live Comparison page:

**Table 1** Operations
Operation	Description
Switch services	Click the service name and select the target service in the Switch services pop-up window.
Delete comparison services	Click to the right of the service name to remove the service from the comparison.
Stop generating	While the model is responding, click Stop response in the input box to interrupt the response.
Regenerate response	Click below the model's response to regenerate it.
Copy response	Click below the model's response to copy it.
Provide feedback	Click below the model's response to provide feedback on the output.
Start new chat	Click New Chat in the upper-right corner to clear the current conversation. You can then click Service Comparison in the upper-right corner to select services again and start a new chat.
Clear chat	Click Clear Chat in the upper-right corner to clear the context of the current conversation. Subsequent inputs will not be affected by previous turns.
View history	View your past conversations in the left pane. You can click any history record to resume the conversation and continue asking questions. By default, your first question is used as the title of that history record.
Edit history titles	In the left pane, click to the right of a conversation title and choose Edit Title. In the Edit Title dialog box, modify the title as needed and click OK. Figure 1 Editing a title
Delete history	In the left pane, click to the right of a conversation title and choose Delete. In the Delete Chat dialog box, click OK to delete all records of that conversation.

Service Parameter Configuration

When calling LLMs, you may find that the generated responses deviate significantly from your expectations. You can refine the model's output by adjusting decoding parameters to control its randomness and creativity. In essence, these parameters determine whether the model responds like a rigorous scientist or creates like a romantic poet.

**Table 2** Service parameters
Parameter	Description	Example	Recommended Tuning Order
System persona	System persona of a custom model. Enter up to 1,000 characters.	You are a system AI assistant.	-
Temperature	Controls the randomness and creativity of the model's output. A higher temperature produces more unpredictable, more creative results. A lower temperature produces more predictable, more conservative results. Low (0.1): The model is extremely conservative, always choosing the highest-probability tokens. Ideal for scenarios requiring factual accuracy. High (0.9): The model becomes more "expressive," selecting lower-probability tokens. Ideal for creative tasks, though it may lead to hallucinations.	Prompt: Write a sentence using the word "sky." Temperature = 0.1 (Conservative) Result: The sky is blue with a few white clouds. Characteristics: Accurate, straightforward, and highly reproducible. Temperature = 0.9 (Creative) Result: The sky resembled a jar of overturned blueberry jam, with stars floating within it. Characteristics: Vivid, varied, and inconsistent across runs.	Primary adjustment
Top_P	Controls the diversity of the model's output. A larger value indicates stronger diversity of the generated text. Dynamically selects the top tokens based on cumulative probability. Higher values allow for a richer (though potentially rarer) vocabulary. Rather than a fixed count, Top_P uses a cumulative threshold. The model ranks tokens by probability and keeps only those whose sum reaches the P-value (e.g., 0.9).	Top_P = 0.1: Only the most stable, top-tier tokens are considered. Top_P = 0.9: Allows long-tail vocabulary into the candidate pool, resulting in more diverse wording. Note: Top_P is dynamic. If the next set of tokens is certain, the pool is small; if uncertain, the pool expands. This is generally more intelligent than Top-K.	Secondary fine-tuning (used with Temperature)
Top_K	Controls the creativity or randomness of the generated text. A smaller K value produces smoother, more logically consistent sentences, but they may be dull or repetitive. A larger K value produces richer and more creative sentences, but also increases the chance of implausible words (hallucinations). Caps the candidate pool to a fixed number (K) of top-ranked tokens. Higher values keep more candidates.	Top-K = 1: Greedy decoding. The model only ever picks the top 1 candidate (equivalent to an extremely low Temperature). Top-K = 50: Typically used to prevent the model from generating low-probability gibberish or incoherent characters.	Supplementary parameter (usually kept at default or a high value)

The following table describes the parameter configurations for typical scenarios.

**Table 3** Parameter configurations for typical scenarios
Application Scenario	Recommended Configuration	Desired Effect	Typical Use Case
Code generation Mathematical problem solving	Temp: 0.0 - 0.2 Top_P: 0.1	Highly precise Eliminates randomness to ensure logical correctness and strict grammatical adherence.	AI-assisted coding, SQL generation, and logical reasoning
Knowledge Q&A Customer service	Temp: 0.3 - 0.5 Top_P: 0.7	Stable & natural Ensures factual accuracy while maintaining a more human-like tone than a standard bot.	Intelligent customer service and RAG-based document QA
Copywriting Chit-chat	Temp: 0.7 - 0.9 Top_P: 0.9	Rich & diverse Utilizes a broad vocabulary and varied sentence structures to maximize creativity.	Marketing copy, creative writing/story extension, and role-playing
Brainstorming	Temp: 1.0+ Top_P: 0.95	Unconstrained: Breaks away from conventional logic to find unexpected inspiration (requires manual filtering).	Creative ideation and naming

Metrics

Besides comparing texts subjectively, you can see technical metrics below the model's answer to help with quantitative evaluation.

Figure 2 Technical metrics

**Table 4** Model metrics
Type	Name	Description
Performance	Total Time	Total time required to complete the entire response. A shorter duration indicates higher inference performance.
	Reasoning Time	Time spent on thinking.
	TTFT	Time to first token (TTFT) is the time from when a user sends a question to when the first token of the AI's answer shows up on the screen. A lower TTFT indicates faster initial responsiveness.
	TPOT	Time per output token (TPOT) is the average time needed to create each token after the first one appears. A lower TPOT indicates faster and smoother text generation.
Consumption	token consumption	Displays the number of input tokens and output tokens for the session. This is used to estimate API call costs and resource consumption.

Next topic: Model Evaluation

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot