Updated on 2025-07-28 GMT+08:00

Viewing the DeepSeek Model Evaluation Report

After an evaluation job is created, you can view the task report. The procedure is as follows:

  1. Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
    Figure 1 My Spaces
  2. In the navigation pane, choose Model Development > Model Evaluation > Job Management.
  3. Click Assessment report in the Operation column. On the displayed page, you can view the basic information and overview of the evaluation job.

    For details about each evaluation metric, see DeepSeek Model Evaluation Metrics.

  4. Export the evaluation report.
    1. On the Evaluation Report > Evaluation Details page, click Export, select the report to be exported, and click OK.
    2. Click Export Record on the right to view the exported task ID. Click Download in the Operation column to download the evaluation report to the local PC.

DeepSeek Model Evaluation Metrics

The DeepSeek Model supports automatic evaluation and manual evaluation. For details about the metrics, see Table 1, Table 2, and Table 3.

Table 1 Automatic evaluation metrics of the DeepSeek model (not using preset evaluation datasets)

Evaluation Metric (Automatic Evaluation - Evaluation Template Not Used)

Description

F1_SCORE

Harmonic mean of the precision and recall. A larger value indicates better model performance.

BLEU-1

Matching degree between the sentence generated by the model and the actual sentence at the single-word level. A larger value indicates better model performance.

BLEU-2

Matching degree between the sentence generated by the model and the actual sentence at the phrase level. A larger value indicates better model performance.

BLEU-4

Weighted average accuracy of the model generation result and actual sentences. A larger value indicates better model performance.

ROUGE-1

Similarity between the sentence generated by the model and the actual sentence at the single-word level. A larger value indicates better model performance.

ROUGE-2

Similarity between the sentence generated by the model and the actual sentence at the two-word level. A larger value indicates better model performance.

ROUGE-L

Similarity between the sentence generated by the model and the actual sentence at the longest common subsequence (LCS) level. A larger value indicates better model performance.

PRECISION

Accuracy of Q&A matching, that is, the accuracy of sentences generated by the model compared with actual sentences. A larger value indicates better model performance.

Table 2 Automatic evaluation metrics of the NLP model (using preset evaluation datasets)

Evaluation Metric (Automatic Evaluation- Evaluation Template Used)

Description

Score

The score of each dataset is the pass rate of the model in the current dataset. If there are multiple datasets in the evaluation capability items, the weighted average pass rate is calculated based on the data volume.

Comprehensive capability

The comprehensive capability is the weighted average of the pass rates of all datasets.

Table 3 Manual evaluation metrics of the DeepSeek model

Evaluation Metric (Manual Evaluation)

Description

Accuracy

The answer generated by the model is correct and there is no factual error.

average

The model calculates the average score of the generated sentence and the actual sentence based on the evaluation metric.

goodcase

The model calculates the proportion of test cases whose score is 5 after the generated sentence and the actual sentence are compared based on the evaluation metric.

badcase

The model calculates the proportion of test cases whose score is less than 1 after the generated sentence and the actual sentence are compared based on the evaluation metric.

Custom metric

Custom metrics, such as usability, logic, and security.