Model Evaluation Features

Description

Model evaluation is the process of testing and measuring the performance of a foundation model in real-world scenarios. It is crucial for understanding the performance of a foundation model.

A model with excellent performance must possess strong generalization capabilities. This means that the model should perform well not only on the provided data (training data) but also on unseen data. To achieve this goal, model evaluation is indispensable.

In the ModelArts model development process, model evaluation assesses new models after training. Only evaluated models can be deployed and used. This is a crucial step in the model development workflow.

Why Model Evaluation Matters

Model evaluation helps you identify the strengths and weaknesses of a model, ensuring its effectiveness in real-world applications and its ability to handle specific tasks while meeting relevant requirements.

When collecting evaluation datasets, ensure that the datasets are independent and random to guarantee that the collected data can represent a sample of real-world data. This helps to avoid biasing the evaluation result, thereby more accurately reflecting the performance of the model in different scenarios. By using the evaluation dataset to evaluate a model, developers can understand the advantages and disadvantages of the model and find the optimization direction.

Core value of model evaluation for developers:

Verify training effectiveness: Measure the degree of capability improvement following fine-tuning or incremental pre-training.
Identify optimization paths: Pinpoint model weaknesses in specific tasks to guide subsequent iterations.
Support deployment decisions: Use quantitative metrics to determine if a model meets production standards.
Compare model selection: Evaluate and select the most suitable model version for specific business scenarios from multiple candidates.
Ensure regulatory compliance: Provide quantitative evidence of model capabilities to support auditing and compliance requirements.

Model Evaluation Scenario

Model evaluation primarily assesses a model's knowledge retention and text comprehension capabilities. These capabilities can be classified into general capabilities and industry-specific capabilities. The following sections describe the application scenarios of general capability evaluation and industry capability evaluation.

General Capability Evaluation

General capabilities: Primarily includes evaluation tasks using general-domain datasets, such as text classification, logical reasoning, sentiment analysis, and question-answering (QA) systems.

Typical scenarios

Text classification accuracy evaluation
Logical reasoning capability assessment
Sentiment analysis accuracy evaluation
Reading comprehension and QA system evaluation
Text summarization quality assessment
Machine translation fluency evaluation

Recommended dataset sources: ModelArts provides management features for open-source evaluation sets, enabling you to easily leverage these datasets for more precise and efficient LLM evaluations.

Industry-specific Capability Evaluation

Industry capabilities: Primarily focuses on evaluation tasks using domain-specific datasets, such as financial entity recognition, financial text classification, and debt collection intent recognition.

Typical scenarios

Finance: Entity recognition, contract clause classification, and risk control intent recognition
Healthcare: Medical Q&A, medical record summarization, and drug information extraction

Recommended dataset sources: Creating custom evaluation sets: To evaluate a model's domain-specific knowledge, you can use homologous datasets to build evaluation sets for tasks like entity recognition, text classification, or content generation. Use precision, recall, and F-score as the primary evaluation metrics.

Model Evaluation Types

ModelArts offers strong model evaluation features. It supports both human and automated evaluations.

Automated Evaluation

Automated evaluation: Supports two types: rule-based and LLM-based.

Rule-based: Automatically evaluates model-generated responses based on similarity or accuracy. You can use professional datasets pre-configured in evaluation templates or upload custom datasets.

Applicability: Closed-ended tasks with clear standard answers, such as classification, entity recognition, and multiple-choice QA.

Operation: The system automatically compares model outputs with reference answers in the dataset, calculating scores based on similarity algorithms or accuracy rules.

LLM-based: Uses an LLM to automatically score the outputs of the model under test. This is suitable for open-ended or complex QA scenarios and includes scoring mode and comparison mode.

Applicability: Open-ended tasks without a single correct answer, such as creative writing, open-ended QA, and dialogue generation.

Sub-modes: See Table 1.

**Table 1** Table 1: LLM-based evaluation sub-modes
Sub-mode	Description	Typical Use Case
Scoring mode	Uses a judge LLM to provide multi-dimensional scores for a model's output.	Assessing the generation quality of a single model
Comparison mode	Uses a judge LLM to compare outputs from two models simultaneously and determine which is superior.	Model A/B testing and selection

Human Evaluation

Human evaluation: Evaluates model-generated responses using manually created datasets and specific evaluation criteria. During the process, human evaluators score the responses based on predefined metrics. Once completed, an evaluation report is generated based on these scores.

Applicability: Scenarios requiring subjective human judgment, such as dimensions like style, tone, professionalism, and safety, which are difficult to measure using automated rules.

Operation: Evaluate and score each data entry on the human evaluation page. Once all data has been reviewed, click the submit button to submit the results.

Parent Topic: Model Evaluation

Previous topic: Model Evaluation

Next topic: Creating a Model Evaluation Task

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

For any further questions, feel free to contact us through the chatbot.

Chatbot