Updated on 2025-11-04 GMT+08:00

Creating a DeepSeek Model Evaluation Dataset

A model with excellent performance must possess strong generalization capabilities. This means that the model should perform well not only on the provided data (training data) but also on unseen data. To achieve this goal, model evaluation is indispensable. When collecting evaluation datasets, ensure that the datasets are independent and random to guarantee that the collected data can represent a sample of real-world data. This helps to avoid biasing the evaluation result, thereby more accurately reflecting the performance of the model in different scenarios. By using the evaluation dataset to evaluate a model, developers can understand the advantages and disadvantages of the model and find the optimization direction.

The DeepSeek model supports both manual and automatic evaluation modes.

  • Manual evaluation: The answers generated by the model are evaluated based on the manually created evaluation dataset and evaluation metrics. During evaluation, the model answers need to be manually scored based on the created evaluation items. After the evaluation is complete, an evaluation report is generated based on the scoring result.
  • Automatic evaluation: There are two types of evaluation rules: rule-based and LLM-based.
    • The answers generated by the model are automatically evaluated based on rules (similarity/accuracy). You can use the professional datasets preset in the evaluation template or customize evaluation datasets. For details about the supported preset datasets, see Table 1.
    • The model is used to automatically score the generated result of the evaluation model, which is applicable to open or complex Q&A scenarios. There are grading mode and comparison mode.
      Table 1 Preset evaluation sets

      Type

      Input Dataset

      Description

      General knowledge and skills

      Common sense

      Evaluate the mastery of basic knowledge and information about daily life, including history, geography, and culture.

      Mathematical capability

      Evaluate the model's capability of solving mathematical problems, including arithmetic operations, algebraic equation solving, and geometric shape analysis.

      Logical reasoning

      Evaluate the capability of the evaluation model to perform reasonable inference and analysis based on known information, involving multiple logical processes of thinking such as deduction and induction.

      Chinese capability

      Evaluate the advanced knowledge and inference capabilities of large models in the Chinese language and cultural background.

      Domain knowledge

      Finance

      Evaluate the capabilities of large models in the financial field, including quick understanding and interpretation of complex financial concepts, risk prediction and data analysis, investment suggestions, and financial decision-making support.

Importing or Uploading Evaluation Datasets from a Local PC, OBS files, and Data Engineering Datasets

  1. Log in to ModelArts Studio and access the required workspace.
  2. In the navigation pane, choose Evaluation Center > Evaluation Set Management. Click Create evaluation set in the upper right corner.
  3. On the Create evaluation set page, select the required file content, file format, and import source, and select a path to upload the data file.
    Figure 1 Importing an evaluation set
    Figure 2 Importing an evaluation set

    Model Type

    Evaluation dataset format

    Size

    DeepSeek model

    json/csv/xlsx

    The maximum size of a file uploaded from the local PC is 10 MB, and that of a file uploaded from OBS is 100 MB.

Creating a Dataset Using Data Engineering

The procedure for creating an evaluation dataset is the same as that for creating a training dataset. This section describes only the procedure. For details, see Using Data Engineering to Build a DeepSeek Model Dataset.

  1. Log in to ModelArts Studio and access the required workspace.
  2. In the navigation pane, choose Data Engineering > Data Acquisition. On the Import Task page, click Create Import Job in the upper right corner.
  3. On the Create Import Job page, select the required file content, file format, and import source, and click Select Path to upload the data file.
    Table 2 lists the formats supported by the DeepSeek model evaluation dataset.
    Table 2 Evaluation dataset format

    Model Type

    Evaluation dataset format

    DeepSeek model

    Text - Single Round QA - jsonl

  4. After the data file is uploaded, enter the dataset name and description, and click Create Now.
  5. In the navigation pane, choose Data Engineering > Data Publishing > Publish Task. On the displayed page, click Create Data Publish Task in the upper right corner.
  6. On the Create Data Publish Task page, select a dataset modality and a dataset file.
  7. Click Next, select a format, enter the name, select the dataset visibility, and click OK.

    To evaluate a Pangu model, you need to publish the dataset in Pangu format.

Setting Prompts

Set a prompt template for the evaluation set.

  1. Binding prompts during evaluation set import

    Import an NLP model evaluation set in Excel, CSV, or JSONL format. Fill in the prompt template and prompt name columns in the uploaded Excel file.

  2. Setting a prompt template for an existing evaluation set
    Click Evaluation Set Management. On the Custom Evaluation Dataset page, select the evaluation set and click the evaluation set name. The evaluation set details page is displayed.
    Figure 3 Custom evaluation set
    Select a category name in Configure prompt words area, and then click Configure prompt words.
    Figure 4 Basic information
    Figure 5 Configuring prompts
    Figure 6 Configured