Viewing the DeepSeek Model Evaluation Report
After an evaluation job is created, you can view the task report. The procedure is as follows:
- Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 1 My Spaces
- In the navigation pane, choose Model Development > Model Evaluation > Job Management.
- Click Assessment report in the Operation column. On the displayed page, you can view the basic information and overview of the evaluation job.
For details about each evaluation metric, see DeepSeek Model Evaluation Metrics.
- Export the evaluation report.
- On the Evaluation Report > Evaluation Details page, click Export, select the report to be exported, and click OK.
- Click Export Record on the right to view the exported task ID. Click Download in the Operation column to download the evaluation report to the local PC.
DeepSeek Model Evaluation Metrics
The DeepSeek Model supports automatic evaluation and manual evaluation. For details about the metrics, see Table 1, Table 2, and Table 3.
Evaluation Metric (Automatic Evaluation - Evaluation Template Not Used) |
Description |
---|---|
F1_SCORE |
Harmonic mean of the precision and recall. A larger value indicates better model performance. |
BLEU-1 |
Matching degree between the sentence generated by the model and the actual sentence at the single-word level. A larger value indicates better model performance. |
BLEU-2 |
Matching degree between the sentence generated by the model and the actual sentence at the phrase level. A larger value indicates better model performance. |
BLEU-4 |
Weighted average accuracy of the model generation result and actual sentences. A larger value indicates better model performance. |
ROUGE-1 |
Similarity between the sentence generated by the model and the actual sentence at the single-word level. A larger value indicates better model performance. |
ROUGE-2 |
Similarity between the sentence generated by the model and the actual sentence at the two-word level. A larger value indicates better model performance. |
ROUGE-L |
Similarity between the sentence generated by the model and the actual sentence at the longest common subsequence (LCS) level. A larger value indicates better model performance. |
PRECISION |
Accuracy of Q&A matching, that is, the accuracy of sentences generated by the model compared with actual sentences. A larger value indicates better model performance. |
Evaluation Metric (Automatic Evaluation- Evaluation Template Used) |
Description |
---|---|
Score |
The score of each dataset is the pass rate of the model in the current dataset. If there are multiple datasets in the evaluation capability items, the weighted average pass rate is calculated based on the data volume. |
Comprehensive capability |
The comprehensive capability is the weighted average of the pass rates of all datasets. |
Evaluation Metric (Manual Evaluation) |
Description |
---|---|
Accuracy |
The answer generated by the model is correct and there is no factual error. |
average |
The model calculates the average score of the generated sentence and the actual sentence based on the evaluation metric. |
goodcase |
The model calculates the proportion of test cases whose score is 5 after the generated sentence and the actual sentence are compared based on the evaluation metric. |
badcase |
The model calculates the proportion of test cases whose score is less than 1 after the generated sentence and the actual sentence are compared based on the evaluation metric. |
Custom metric |
Custom metrics, such as usability, logic, and security. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot