Model Evaluation Features
Description
Model evaluation is the process of testing and measuring the performance of a foundation model in real-world scenarios. It is crucial for understanding the performance of a foundation model.
A model with excellent performance must possess strong generalization capabilities. This means that the model should perform well not only on the provided data (training data) but also on unseen data. To achieve this goal, model evaluation is indispensable.
In the ModelArts model development process, model evaluation assesses new models after training. Only evaluated models can be deployed and used. This is a crucial step in the model development workflow.
Why Model Evaluation Matters
Model evaluation helps you identify the strengths and weaknesses of a model, ensuring its effectiveness in real-world applications and its ability to handle specific tasks while meeting relevant requirements.
When collecting evaluation datasets, ensure that the datasets are independent and random to guarantee that the collected data can represent a sample of real-world data. This helps to avoid biasing the evaluation result, thereby more accurately reflecting the performance of the model in different scenarios. By using the evaluation dataset to evaluate a model, developers can understand the advantages and disadvantages of the model and find the optimization direction.
Core value of model evaluation for developers:
- Verify training effectiveness: Measure the degree of capability improvement following fine-tuning or incremental pre-training.
- Identify optimization paths: Pinpoint model weaknesses in specific tasks to guide subsequent iterations.
- Support deployment decisions: Use quantitative metrics to determine if a model meets production standards.
- Compare model selection: Evaluate and select the most suitable model version for specific business scenarios from multiple candidates.
- Ensure regulatory compliance: Provide quantitative evidence of model capabilities to support auditing and compliance requirements.
Model Evaluation Scenario
Model evaluation primarily assesses a model's knowledge retention and text comprehension capabilities. These capabilities can be classified into general capabilities and industry-specific capabilities. The following sections describe the application scenarios of general capability evaluation and industry capability evaluation.
General Capability Evaluation
General capabilities: Primarily includes evaluation tasks using general-domain datasets, such as text classification, logical reasoning, sentiment analysis, and question-answering (QA) systems.
Typical scenarios
- Text classification accuracy evaluation
- Logical reasoning capability assessment
- Sentiment analysis accuracy evaluation
- Reading comprehension and QA system evaluation
- Text summarization quality assessment
- Machine translation fluency evaluation
Recommended dataset sources: ModelArts provides management features for open-source evaluation sets, enabling you to easily leverage these datasets for more precise and efficient LLM evaluations.
Industry-specific Capability Evaluation
Industry capabilities: Primarily focuses on evaluation tasks using domain-specific datasets, such as financial entity recognition, financial text classification, and debt collection intent recognition.
Typical scenarios
- Finance: Entity recognition, contract clause classification, and risk control intent recognition
- Healthcare: Medical Q&A, medical record summarization, and drug information extraction
Recommended dataset sources: Creating custom evaluation sets: To evaluate a model's domain-specific knowledge, you can use homologous datasets to build evaluation sets for tasks like entity recognition, text classification, or content generation. Use precision, recall, and F-score as the primary evaluation metrics.
Model Evaluation Types
ModelArts offers strong model evaluation features. It supports both human and automated evaluations.
Automated Evaluation
Automated evaluation: Supports two types: rule-based and LLM-based.
Rule-based: Automatically evaluates model-generated responses based on similarity or accuracy. You can use professional datasets pre-configured in evaluation templates or upload custom datasets.
Applicability: Closed-ended tasks with clear standard answers, such as classification, entity recognition, and multiple-choice QA.
Operation: The system automatically compares model outputs with reference answers in the dataset, calculating scores based on similarity algorithms or accuracy rules.
LLM-based: Uses an LLM to automatically score the outputs of the model under test. This is suitable for open-ended or complex QA scenarios and includes scoring mode and comparison mode.
Applicability: Open-ended tasks without a single correct answer, such as creative writing, open-ended QA, and dialogue generation.
Sub-modes: See Table 1.
| Sub-mode | Description | Typical Use Case |
|---|---|---|
| Scoring mode | Uses a judge LLM to provide multi-dimensional scores for a model's output. | Assessing the generation quality of a single model |
| Comparison mode | Uses a judge LLM to compare outputs from two models simultaneously and determine which is superior. | Model A/B testing and selection |
Human Evaluation
Human evaluation: Evaluates model-generated responses using manually created datasets and specific evaluation criteria. During the process, human evaluators score the responses based on predefined metrics. Once completed, an evaluation report is generated based on these scores.
Applicability: Scenarios requiring subjective human judgment, such as dimensions like style, tone, professionalism, and safety, which are difficult to measure using automated rules.
Operation: Evaluate and score each data entry on the human evaluation page. Once all data has been reviewed, click the submit button to submit the results.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot