Creating a Third-Party Model Evaluation Job

Before creating a third-party model evaluation job, ensure that the operations in Creating a Third-Party Model Evaluation Dataset have been completed.

Creating a Rule-based Automatic Evaluation Task for a Third-Party Model

To create an automatic evaluation job for a third-party model, perform the following steps:

Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 1 My Spaces
In the navigation pane, choose Evaluation Center > Evaluation Task. Click Create automatic task in the upper right corner.

On the Create automatic Evaluation task page, set parameters by referring to Table 1.

**Table 1** Parameters for a rule-based automatic evaluation task of a third-party model
Category	Parameter	Description
selection service	Assessment Type	Select Large Language Models.
selection service	Service source	Currently, only external services can be used to call APIs for evaluation. A maximum of 10 models can be evaluated at a time. External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model. The request body can be in OpenAI, TGI, or custom format. The OpenAI format is a large model request format developed and standardized by OpenAI. The TGI format is a large model request format launched by the HuggingFace team. The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body. For details about JsonPath, see https://github.com/json-path/JsonPath.
Evaluation Configurations	Evaluation Rules	Select Rule-based: Automatic scoring is performed based on rules. That is, scoring is performed based on similarity or accuracy, and the difference between the model's prediction and the labeled data is compared. It is applicable to standard multiple-choice questions or simple Q&A scenarios.
	Evaluation Dataset	Preset evaluation dataset: Use a preset professional dataset for evaluation. Custom evaluation dataset: You can select evaluation metrics (F1 score, accuracy, BLEU, and Rouge) and upload an evaluation dataset. You need to upload the custom evaluation dataset.
	Storage location of measurement results	Path for storing the model evaluation result.
Basic information	Task Name	Enter the evaluation job name.
Basic information	Description	Enter the evaluation job description.

After setting the parameters, click Create Now. The Evaluation Task > Automatic evaluation page is displayed.
When the status is Completed, you can click Evaluation Report in the Operation column to view the report and details of the evaluation job on the Evaluation Report page.

Creating an LLM-based Automatic Evaluation Job for a Third-Party Model

To create an automatic evaluation job for a third-party model, perform the following steps:

Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 2 My Spaces
In the navigation pane, choose Evaluation Center > Evaluation Task. Click Create automatic task in the upper right corner.

On the Create automatic Evaluation task page, set parameters by referring to Table 2.

**Table 2** Parameters for an LLM-based automatic evaluation task of an NLP model
Category	Parameter	Description
selection service	Assessment Type	Select Large Language Models.
selection service	Service source	Currently, only external services can be used to call APIs for evaluation. A maximum of 10 models can be evaluated at a time. External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model. The request body can be in OpenAI, TGI, or custom format. The OpenAI format is a large model request format developed and standardized by OpenAI. The TGI format is a large model request format launched by the HuggingFace team. The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body. For details about JsonPath, see https://github.com/json-path/JsonPath.
Evaluation Configurations	Evaluation Rules	Based on large models: Use large models with stronger capabilities to automatically score the generated results of the evaluated models. This approach is applicable to open or complex Q&A scenarios.
	Select mode	grading model: The referee model automatically scores the model inference result based on the configured scoring criteria. Comparison mode: The referee model compares the performance of two models on each question. The comparison result can be win, lose, or tie. In comparison mode, two services must be selected as the service source. By default, the first service is selected as the benchmark model.
	Evaluation Dataset	Select the dataset to be evaluated. In the NLP multi-turn Q&A scenario, only automatic evaluation based on large models is supported. You can select the multi-turn Q&A evaluation dataset.
	Storage location of measurement results	Path for storing the model evaluation result.
Referee configuration	Referee Model	Deploying services: Select a model deployed on ModelArts Studio for evaluation. External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model. The request body can be in OpenAI, TGI, or custom format. The OpenAI format is a large model request format developed and standardized by OpenAI. The TGI format is a large model request format launched by the HuggingFace team. The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body. For details about JsonPath, see https://github.com/json-path/JsonPath.
Referee configuration	Scoring rules	You can select a preset or custom scoring prompt template. To create a custom prompt template, click Add custom rules and then click newly built. In the displayed dialog box, set the Prompt template name, avatar, Task description, Is contain question, Is contain reference answer, Score strategy, Evaluation metrics, and Notes, and click Save.
Basic information	Task Name	Enter the evaluation job name.
Basic information	Description	Enter the evaluation job description.

After setting the parameters, click Create Now. The Evaluation Task > Automatic evaluation page is displayed.
When the status is Completed, you can click Evaluation Report in the Operation column to view the report and details of the evaluation job on the Evaluation Report page.

Creating a Manual Evaluation Job for a Third-Party Model

To create a manual evaluation job for a third-party model, perform the following steps:

Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 3 My Spaces
In the navigation pane, choose Evaluation Center > Evaluation Task. Click Create manual task in the upper right corner.

On the Create manual Evaluation task page, set parameters by referring to Table 3.

**Table 3** Parameters of creating a manual evaluation job for a third-party model
Category	Parameter	Description
selection service	Assessment Type	Select Large Language Models.
selection service	Service source	Currently, only external services can be used to call APIs for evaluation. A maximum of 10 models can be evaluated at a time. External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model. The request body can be in OpenAI, TGI, or custom format. The OpenAI format is a large model request format developed and standardized by OpenAI. The TGI format is a large model request format launched by the HuggingFace team. The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body. For details about JsonPath, see https://github.com/json-path/JsonPath.
Evaluation Configurations	Evaluation Indicators	You can customize evaluation metrics and fill in evaluation standards.
	Evaluation Dataset	Evaluation dataset.
	Storage location of measurement results	Path for storing the model evaluation result.
Basic information	Task Name	Enter the evaluation job name.
Basic information	Description	Enter the evaluation job description.

After setting the parameters, click Create Now. The Evaluation Task > Manual evaluation page is displayed.
When the status is To be evaluated, you can click Online Evaluation in the Operation column to go to the evaluation page.
Score the evaluation effect area as prompted. After all data is evaluated, click Submit.
- On the evaluation details page, enable Blind testing to hide the model name and perform blind testing.
- Click Doubt or Nullify to doubt or nullify a case. To cancel this operation, click Cancel Doubts or Cancel nullify.
- Click the Click to add a note area to add remarks.
- On the evaluation page, hold down the left mouse button to select the text content to be marked and click Mark to mark the content as key content.
  Figure 4 Manual evaluation
In the navigation pane, choose Evaluation Center > Evaluation task > Manual evaluation. Click Assessment report in the Operation column to view the model evaluation result.
After the evaluation is complete, go to the manual evaluation list page and click Manual Review to review the evaluation. After the review is complete, click Submit to submit the result.