Viewing the DeepSeek Model Evaluation Report

After an evaluation job is created, you can view the task report. The procedure is as follows:

Log in to ModelArts Studio. In the My Spaces area, click the required workspace.
Figure 1 My Spaces
In the navigation pane, choose Model Development > Model Evaluation > Job Management.
Click Assessment report in the Operation column. On the displayed page, you can view the basic information and overview of the evaluation job.
For details about each evaluation metric, see DeepSeek Model Evaluation Metrics.
Export the evaluation report.
1. On the Evaluation Report > Evaluation Details page, click Export, select the report to be exported, and click OK.
2. Click Export Record on the right to view the exported task ID. Click Download in the Operation column to download the evaluation report to the local PC.

The DeepSeek Model supports automatic evaluation and manual evaluation. For details about the metrics, see Table 1, Table 2, and Table 3.

**Table 1** Automatic evaluation metrics of the DeepSeek model (not using preset evaluation datasets)
Evaluation Metric (Automatic Evaluation - Evaluation Template Not Used)	Description
F1_SCORE	Harmonic mean of the precision and recall. A larger value indicates better model performance.
BLEU-1	Matching degree between the sentence generated by the model and the actual sentence at the single-word level. A larger value indicates better model performance.
BLEU-2	Matching degree between the sentence generated by the model and the actual sentence at the phrase level. A larger value indicates better model performance.
BLEU-4	Weighted average accuracy of the model generation result and actual sentences. A larger value indicates better model performance.
ROUGE-1	Similarity between the sentence generated by the model and the actual sentence at the single-word level. A larger value indicates better model performance.
ROUGE-2	Similarity between the sentence generated by the model and the actual sentence at the two-word level. A larger value indicates better model performance.
ROUGE-L	Similarity between the sentence generated by the model and the actual sentence at the longest common subsequence (LCS) level. A larger value indicates better model performance.
PRECISION	Accuracy of Q&A matching, that is, the accuracy of sentences generated by the model compared with actual sentences. A larger value indicates better model performance.

**Table 2** Automatic evaluation metrics of the DeepSeek model (using preset evaluation datasets)
Evaluation Metric (Automatic Evaluation- Evaluation Template Used)	Description
Score	The score of each dataset is the pass rate of the model in the current dataset. If there are multiple datasets in the evaluation capability items, the weighted average pass rate is calculated based on the data volume.
Comprehensive capability	The comprehensive capability is the weighted average of the pass rates of all datasets.

**Table 3** Manual evaluation metrics of the DeepSeek model
Evaluation Metric (Manual Evaluation)	Description
Accuracy	The answer generated by the model is correct and there is no factual error.
average	The model calculates the average score of the generated sentence and the actual sentence based on the evaluation metric.
goodcase	The model calculates the proportion of test cases whose score is 5 after the generated sentence and the actual sentence are compared based on the evaluation metric.
badcase	The model calculates the proportion of test cases whose score is less than 1 after the generated sentence and the actual sentence are compared based on the evaluation metric.
Custom metric	Custom metrics, such as usability, logic, and security.