Live Comparison
Overview
A critical challenge in the industrialization of AI is selecting the most suitable model from a vast array of foundation models and fine-tuned versions. The live comparison feature provides an intuitive evaluation platform, allowing you to perform side-by-side benchmarking of different models using identical inputs.
Core functions and value:
- Foundation model selection: In the early stages of a project, compare the performance of different models like DeepSeek, Qwen3, and GLM to find the best fit for your business needs.
- Fine-tuning validation: By performing synchronous comparisons between a native foundation model and its fine-tuned version, visually verify whether domain-specific knowledge has been successfully injected or if any catastrophic forgetting has occurred.
- Parameter strategy optimization (A/B testing): Compare output variations of the same model using different hyperparameters (such as Temperature or Top_P) to determine the optimal inference configuration.
Prerequisites
An inference service has been deployed. For details, see Deploying a Model as a Real-Time Service.
Constraints
- Model type restrictions: Currently, only text generation models in the LLM domain are supported for comparison. Models in other domains are not supported.
- Quantity limits: To ensure optimal frontend rendering performance and facilitate manual side-by-side evaluation, a maximum of three models can be compared simultaneously in a single task.
- Timeout limits: During a real-time comparison task, if a model fails to complete its response within 5 minutes due to reasoning latency or performance issues, the corresponding window will trigger a timeout alert and terminate the generation.
- Chat history limits: The system automatically saves the 100 most recent chat history records for each IAM user within the same workspace. When the number of records exceeds 100, the oldest record is immediately deleted to maintain the 100-record limit. Additionally, for chat histories older than 7 days, the system automatically deletes the oldest records daily at 00:00.
Procedure
- Log in to the ModelArts console. In the navigation pane on the left, choose Model Evaluation > Live Comparison.
There are multiple entry points for live comparison. In addition to the left navigation pane, you can also use the following methods:
- In the navigation pane on the left, choose Model Inference > Real-Time Inference. Click Live Comparison in the Operation column on the right.
- In the navigation pane on the left, choose Model Inference > Real-Time Inference. Click the target service name to go to the service details page. Click Live Comparison in the upper right corner.
- In the upper right corner of the Live Comparison page, click Service Comparison. In the Live Comparison | Select Service dialog box, select one to three services as required and click OK.
- Configure service parameters as required.
Click
next to the service name to set parameters like system persona, temperature, and top_p. This changes how random and diverse the output can be. To ensure a fair evaluation (controlled variables), keep the parameter configurations consistent across all selected services.For details about service parameters and typical scenarios, see Service Parameter Configuration.
- Click a preset question in the middle of the page or enter a question in the text box. Click
or press Enter on the keyboard to send the question. Press Shift+Enter to start a new line. - The system will send this question to all selected models.
- Both single-turn Q&A and multi-turn dialogues in the current context are supported.
If multiple models are selected for comparison, the system displays the generation results of each model in a side-by-side view. This allows you to evaluate and compare the models' performance regarding logical consistency, formatting accuracy, and semantic precision.
The total time taken and the thinking time will show below the model's answer. For details, see Metrics.
You can perform the following operations on the Live Comparison page:
Table 1 Operations Operation
Description
Switch services
Click the service name and select the target service in the Switch services pop-up window.
Delete comparison services
Click
to the right of the service name to remove the service from the comparison.Stop generating
While the model is responding, click Stop response in the input box to interrupt the response.
Regenerate response
Click
below the model's response to regenerate it.Copy response
Click
below the model's response to copy it.Provide feedback
Click
below the model's response to provide feedback on the output.Start new chat
Click New Chat in the upper-right corner to clear the current conversation. You can then click Service Comparison in the upper-right corner to select services again and start a new chat.
Clear chat
Click Clear Chat in the upper-right corner to clear the context of the current conversation. Subsequent inputs will not be affected by previous turns.
View history
View your past conversations in the left pane. You can click any history record to resume the conversation and continue asking questions. By default, your first question is used as the title of that history record.
Edit history titles
In the left pane, click
to the right of a conversation title and choose Edit Title. In the Edit Title dialog box, modify the title as needed and click OK.Figure 1 Editing a title
Delete history
In the left pane, click
to the right of a conversation title and choose Delete. In the Delete Chat dialog box, click OK to delete all records of that conversation.
Service Parameter Configuration
When calling LLMs, you may find that the generated responses deviate significantly from your expectations. You can refine the model's output by adjusting decoding parameters to control its randomness and creativity. In essence, these parameters determine whether the model responds like a rigorous scientist or creates like a romantic poet.
| Parameter | Description | Example | Recommended Tuning Order |
|---|---|---|---|
| System persona | System persona of a custom model. Enter up to 1,000 characters. | You are a system AI assistant. | - |
| Temperature | Controls the randomness and creativity of the model's output. A higher temperature produces more unpredictable, more creative results. A lower temperature produces more predictable, more conservative results.
| Prompt: Write a sentence using the word "sky."
| Primary adjustment |
| Top_P | Controls the diversity of the model's output. A larger value indicates stronger diversity of the generated text. Dynamically selects the top tokens based on cumulative probability. Higher values allow for a richer (though potentially rarer) vocabulary. Rather than a fixed count, Top_P uses a cumulative threshold. The model ranks tokens by probability and keeps only those whose sum reaches the P-value (e.g., 0.9). |
| Secondary fine-tuning (used with Temperature) |
| Top_K | Controls the creativity or randomness of the generated text. A smaller K value produces smoother, more logically consistent sentences, but they may be dull or repetitive. A larger K value produces richer and more creative sentences, but also increases the chance of implausible words (hallucinations). Caps the candidate pool to a fixed number (K) of top-ranked tokens. Higher values keep more candidates. |
| Supplementary parameter (usually kept at default or a high value) |
The following table describes the parameter configurations for typical scenarios.
| Application Scenario | Recommended Configuration | Desired Effect | Typical Use Case |
|---|---|---|---|
| Code generation Mathematical problem solving | Temp: 0.0 - 0.2 Top_P: 0.1 | Highly precise Eliminates randomness to ensure logical correctness and strict grammatical adherence. | AI-assisted coding, SQL generation, and logical reasoning |
| Knowledge Q&A Customer service | Temp: 0.3 - 0.5 Top_P: 0.7 | Stable & natural Ensures factual accuracy while maintaining a more human-like tone than a standard bot. | Intelligent customer service and RAG-based document QA |
| Copywriting Chit-chat | Temp: 0.7 - 0.9 Top_P: 0.9 | Rich & diverse Utilizes a broad vocabulary and varied sentence structures to maximize creativity. | Marketing copy, creative writing/story extension, and role-playing |
| Brainstorming | Temp: 1.0+ Top_P: 0.95 | Unconstrained: Breaks away from conventional logic to find unexpected inspiration (requires manual filtering). | Creative ideation and naming |
Metrics
Besides comparing texts subjectively, you can see technical metrics below the model's answer to help with quantitative evaluation.
| Type | Name | Description |
|---|---|---|
| Performance | Total Time | Total time required to complete the entire response. A shorter duration indicates higher inference performance. |
| Reasoning Time | Time spent on thinking. | |
| TTFT | Time to first token (TTFT) is the time from when a user sends a question to when the first token of the AI's answer shows up on the screen. A lower TTFT indicates faster initial responsiveness. | |
| TPOT | Time per output token (TPOT) is the average time needed to create each token after the first one appears. A lower TPOT indicates faster and smoother text generation. | |
| Consumption | token consumption | Displays the number of input tokens and output tokens for the session. This is used to estimate API call costs and resource consumption. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot