Creating a DeepSeek Model Evaluation Job
Before creating an NLP model evaluation job, ensure that the operations in Creating a DeepSeek Model Evaluation Dataset have been completed.
Pre-trained NLP models cannot be evaluated.
Creating a Rule-based Automatic Evaluation Task for a DeepSeek Model
To create an automatic evaluation job for an NLP model, perform the following steps:
- Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 1 My Spaces
- In the navigation pane, choose Evaluation Center > Evaluation task. Click Create automatic task in the upper right corner.
- On the Create automatic Evaluation task page, set deployment parameters by referring to Table 1.
Table 1 Parameters for a rule-based automatic evaluation task of a DeepSeek model Category
Parameter Name
Description
selection service
Model Type
Select Large Language Models.
Service source
Two options are available: Deploying services and External services. A maximum of 10 models can be evaluated at a time.
- Deploying services: Select a model deployed on ModelArts Studio for evaluation.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- The request body can be in OpenAI, TGI, or custom format. The openai format is a large model request format developed and standardized by OpenAI. The tgi format is a large model request format launched by the HuggingFace team.
- The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body.
Evaluation Configurations
Evaluation Rules
Select Rule-based: Automatic scoring is performed based on rules. That is, scoring is performed based on similarity or accuracy, and the difference between the model's prediction and the labeled data is compared. It is applicable to standard multiple-choice questions or simple Q&A scenarios.
Evaluation Dataset
- Preset evaluation dataset: Use a preset professional dataset for evaluation.
- Single review set: You can specify evaluation metrics (F1 score, accuracy, BLEU, and rouge) and upload an evaluation dataset for evaluation. If you select Single review set, you need to upload the dataset used for the evaluation.
Storage location of measurement results
Path for storing the model evaluation result.
Basic Information
Task Name
Enter the evaluation job name.
Description
Enter the evaluation job description.
- After setting the parameters, click Create Now. The Evaluation Task > Automatic evaluation page is displayed.
- When the status is Completed, you can click Evaluation Report in the Operation column to view the model evaluation result, including the detailed score and evaluation details.
Creating an LLM-based Automatic Evaluation Task for a DeepSeek Model
To create an automatic evaluation job for a DeepSeek model, perform the following steps:
- Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 2 My Spaces
- In the navigation pane, choose Evaluation Center > Evaluation task. Click Create automatic task in the upper right corner.
- On the Create automatic Evaluation task page, set parameters by referring to Table 2.
Table 2 Parameters for an LLM-based automatic evaluation task of a DeepSeek model Category
Parameter Name
Description
selection service
Model Type
Select Large Language Models.
Service source
Two options are available: Deploying services and External services. A maximum of 10 models can be evaluated at a time.
- Deploying services: Select a model deployed on ModelArts Studio for evaluation.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- The request body can be in OpenAI, TGI, or custom format. The openai format is a large model request format developed and standardized by OpenAI. The tgi format is a large model request format launched by the HuggingFace team.
- The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body.
Evaluation Configurations
Evaluation Rules
Based on large models: Use large models with stronger capabilities to automatically score the generated results of the evaluated models. This approach is applicable to open or complex Q&A scenarios.
Select mode
- grading model: The referee model automatically scores the model inference result based on the configured scoring criteria.
- Comparison mode: The referee model compares the performance of two models on each question. The comparison result can be win, lose, or tie. In comparison mode, two services must be selected as the service source. By default, the first service is selected as the benchmark model.
Scoring Prompt Template
- In grading mode, the default value is score_prompt. The prompt contains the standard reply of the current scenario. The prompt is input to the referee model for scoring in the scoring phase.
- In comparison mode, the default value is arena_prompt. The prompt contains the standard reply of the current scenario, which is used by the referee model to compare the advantages and disadvantages of two services.
During this process, you can modify metrics such as the metric evaluation dimension in the Variables area on the right. You can modify the scoring metrics and steps.
Evaluation Dataset
Select the dataset to be evaluated. In the NLP multi-turn Q&A scenario, only automatic evaluation based on large models is supported. You can select the multi-turn Q&A evaluation dataset.
Storage location of measurement results
Path for storing the model evaluation result.
Referee configuration
Referee Model
You can select a deployed service or an external service.
Scoring rules
Scoring rules can be customized. The referee model scores or compares model results based on the configured rules.
Basic Information
Task Name
Enter the evaluation job name.
Description
Enter the evaluation job description.
- After setting the parameters, click Create Now. The Evaluation Task > Automatic evaluation page is displayed.
- When the status is Completed, you can click Evaluation Report in the Operation column to view the model evaluation result, including the detailed score and evaluation details.
Creating a Manual Evaluation Job for a DeepSeek Model
To create a manual evaluation job for an NLP model, perform the following steps:
- Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 3 My Spaces
- In the navigation pane, choose Evaluation Center > Evaluation task. Click Create manual task in the upper right corner.
- On the Create manual Evaluation task page, set parameters by referring to Table 3.
Table 3 Parameters of creating a manual evaluation job for a DeepSeek model Category
Parameter Name
Description
selection service
Model Type
Select Large Language Models.
Service source
Two options are available: Deploying services and External services. A maximum of 10 models can be evaluated at a time.
- Deploying services: Select a model deployed on ModelArts Studio for evaluation.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- The request body can be in OpenAI, TGI, or custom format. The openai format is a large model request format developed and standardized by OpenAI. The tgi format is a large model request format launched by the HuggingFace team.
- The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body.
Evaluation Configurations
Evaluation Indicators
You can customize evaluation metrics and fill in evaluation standards.
Evaluation Dataset
Evaluation dataset.
Storage location of measurement results
Path for storing the model evaluation result.
Basic Information
Task Name
Enter the evaluation job name.
Description
Enter the evaluation job description.
- After setting the parameters, click Create Now. The Evaluation Task > Manual evaluation page is displayed.
- When the status is To be evaluated, you can click Online Evaluation in the Operation column to go to the evaluation page.
- Score the evaluation effect area as prompted. After all data is evaluated, click Submit.
- On the Manual evaluation tab page, check that the status of the evaluation job is Completed. Click Assessment report in the Operation column to view the model evaluation result.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot