Updated on 2025-07-28 GMT+08:00

Creating a DeepSeek Model Evaluation Job

Before creating an NLP model evaluation job, ensure that the operations in Creating a DeepSeek Model Evaluation Dataset have been completed.

Pre-trained NLP models cannot be evaluated.

Creating a Rule-based Automatic Evaluation Task for a DeepSeek Model

To create an automatic evaluation job for an NLP model, perform the following steps:

  1. Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
    Figure 1 My Spaces
  2. In the navigation pane, choose Evaluation Center > Evaluation task. Click Create automatic task in the upper right corner.
  3. On the Create automatic Evaluation task page, set deployment parameters by referring to Table 1.
    Table 1 Parameters for a rule-based automatic evaluation task of a DeepSeek model

    Category

    Parameter Name

    Description

    selection service

    Model Type

    Select Large Language Models.

    Service source

    Two options are available: Deploying services and External services. A maximum of 10 models can be evaluated at a time.

    • Deploying services: Select a model deployed on ModelArts Studio for evaluation.
    • External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
      • The request body can be in OpenAI, TGI, or custom format. The openai format is a large model request format developed and standardized by OpenAI. The tgi format is a large model request format launched by the HuggingFace team.
      • The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body.

    Evaluation Configurations

    Evaluation Rules

    Select Rule-based: Automatic scoring is performed based on rules. That is, scoring is performed based on similarity or accuracy, and the difference between the model's prediction and the labeled data is compared. It is applicable to standard multiple-choice questions or simple Q&A scenarios.

    Evaluation Dataset

    • Preset evaluation dataset: Use a preset professional dataset for evaluation.
    • Single review set: You can specify evaluation metrics (F1 score, accuracy, BLEU, and rouge) and upload an evaluation dataset for evaluation. If you select Single review set, you need to upload the dataset used for the evaluation.

    Storage location of measurement results

    Path for storing the model evaluation result.

    Basic Information

    Task Name

    Enter the evaluation job name.

    Description

    Enter the evaluation job description.

  4. After setting the parameters, click Create Now. The Evaluation Task > Automatic evaluation page is displayed.
  5. When the status is Completed, you can click Evaluation Report in the Operation column to view the model evaluation result, including the detailed score and evaluation details.

Creating an LLM-based Automatic Evaluation Task for a DeepSeek Model

To create an automatic evaluation job for a DeepSeek model, perform the following steps:

  1. Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
    Figure 2 My Spaces
  2. In the navigation pane, choose Evaluation Center > Evaluation task. Click Create automatic task in the upper right corner.
  3. On the Create automatic Evaluation task page, set parameters by referring to Table 2.
    Table 2 Parameters for an LLM-based automatic evaluation task of a DeepSeek model

    Category

    Parameter Name

    Description

    selection service

    Model Type

    Select Large Language Models.

    Service source

    Two options are available: Deploying services and External services. A maximum of 10 models can be evaluated at a time.

    • Deploying services: Select a model deployed on ModelArts Studio for evaluation.
    • External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
      • The request body can be in OpenAI, TGI, or custom format. The openai format is a large model request format developed and standardized by OpenAI. The tgi format is a large model request format launched by the HuggingFace team.
      • The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body.

    Evaluation Configurations

    Evaluation Rules

    Based on large models: Use large models with stronger capabilities to automatically score the generated results of the evaluated models. This approach is applicable to open or complex Q&A scenarios.

    Select mode

    • grading model: The referee model automatically scores the model inference result based on the configured scoring criteria.
    • Comparison mode: The referee model compares the performance of two models on each question. The comparison result can be win, lose, or tie. In comparison mode, two services must be selected as the service source. By default, the first service is selected as the benchmark model.

    Scoring Prompt Template

    • In grading mode, the default value is score_prompt. The prompt contains the standard reply of the current scenario. The prompt is input to the referee model for scoring in the scoring phase.
    • In comparison mode, the default value is arena_prompt. The prompt contains the standard reply of the current scenario, which is used by the referee model to compare the advantages and disadvantages of two services.

    During this process, you can modify metrics such as the metric evaluation dimension in the Variables area on the right. You can modify the scoring metrics and steps.

    Evaluation Dataset

    Select the dataset to be evaluated. In the NLP multi-turn Q&A scenario, only automatic evaluation based on large models is supported. You can select the multi-turn Q&A evaluation dataset.

    Storage location of measurement results

    Path for storing the model evaluation result.

    Referee configuration

    Referee Model

    You can select a deployed service or an external service.

    Scoring rules

    Scoring rules can be customized. The referee model scores or compares model results based on the configured rules.

    Basic Information

    Task Name

    Enter the evaluation job name.

    Description

    Enter the evaluation job description.

  4. After setting the parameters, click Create Now. The Evaluation Task > Automatic evaluation page is displayed.
  5. When the status is Completed, you can click Evaluation Report in the Operation column to view the model evaluation result, including the detailed score and evaluation details.

Creating a Manual Evaluation Job for a DeepSeek Model

To create a manual evaluation job for an NLP model, perform the following steps:

  1. Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
    Figure 3 My Spaces
  2. In the navigation pane, choose Evaluation Center > Evaluation task. Click Create manual task in the upper right corner.
  3. On the Create manual Evaluation task page, set parameters by referring to Table 3.
    Table 3 Parameters of creating a manual evaluation job for a DeepSeek model

    Category

    Parameter Name

    Description

    selection service

    Model Type

    Select Large Language Models.

    Service source

    Two options are available: Deploying services and External services. A maximum of 10 models can be evaluated at a time.

    • Deploying services: Select a model deployed on ModelArts Studio for evaluation.
    • External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
      • The request body can be in OpenAI, TGI, or custom format. The openai format is a large model request format developed and standardized by OpenAI. The tgi format is a large model request format launched by the HuggingFace team.
      • The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body.

    Evaluation Configurations

    Evaluation Indicators

    You can customize evaluation metrics and fill in evaluation standards.

    Evaluation Dataset

    Evaluation dataset.

    Storage location of measurement results

    Path for storing the model evaluation result.

    Basic Information

    Task Name

    Enter the evaluation job name.

    Description

    Enter the evaluation job description.

  4. After setting the parameters, click Create Now. The Evaluation Task > Manual evaluation page is displayed.
  5. When the status is To be evaluated, you can click Online Evaluation in the Operation column to go to the evaluation page.
  6. Score the evaluation effect area as prompted. After all data is evaluated, click Submit.
  7. On the Manual evaluation tab page, check that the status of the evaluation job is Completed. Click Assessment report in the Operation column to view the model evaluation result.