Creating a Third-Party Model Evaluation Job
Before creating a third-party model evaluation job, ensure that the operations in Creating a Third-Party Model Evaluation Dataset have been completed.
Creating a Rule-based Automatic Evaluation Task for a Third-Party Model
To create an automatic evaluation job for a third-party model, perform the following steps:
- Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 1 My Spaces
- In the navigation pane, choose Evaluation Center > Evaluation Task. Click Create automatic task in the upper right corner.
- On the Create automatic Evaluation task page, set parameters by referring to Table 1.
Table 1 Parameters for a rule-based automatic evaluation task of a third-party model Category
Parameter
Description
selection service
Assessment Type
Select Large Language Models.
Service source
Currently, only external services can be used to call APIs for evaluation. A maximum of 10 models can be evaluated at a time.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- The request body can be in OpenAI, TGI, or custom format.
- The OpenAI format is a large model request format developed and standardized by OpenAI.
- The TGI format is a large model request format launched by the HuggingFace team. The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body. For details about JsonPath, see https://github.com/json-path/JsonPath.
Evaluation Configurations
Evaluation Rules
Select Rule-based: Automatic scoring is performed based on rules. That is, scoring is performed based on similarity or accuracy, and the difference between the model's prediction and the labeled data is compared. It is applicable to standard multiple-choice questions or simple Q&A scenarios.
Evaluation Dataset
- Preset evaluation dataset: Use a preset professional dataset for evaluation.
- Custom evaluation dataset: You can select evaluation metrics (F1 score, accuracy, BLEU, and Rouge) and upload an evaluation dataset. You need to upload the custom evaluation dataset.
Storage location of measurement results
Path for storing the model evaluation result.
Basic information
Task Name
Enter the evaluation job name.
Description
Enter the evaluation job description.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- After setting the parameters, click Create Now. The Evaluation Task > Automatic evaluation page is displayed.
- When the status is Completed, you can click Evaluation Report in the Operation column to view the report and details of the evaluation job on the Evaluation Report page.
Creating an LLM-based Automatic Evaluation Job for a Third-Party Model
To create an automatic evaluation job for a third-party model, perform the following steps:
- Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 2 My Spaces
- In the navigation pane, choose Evaluation Center > Evaluation Task. Click Create automatic task in the upper right corner.
- On the Create automatic Evaluation task page, set parameters by referring to Table 2.
Table 2 Parameters for an LLM-based automatic evaluation task of an NLP model Category
Parameter
Description
selection service
Assessment Type
Select Large Language Models.
Service source
Currently, only external services can be used to call APIs for evaluation. A maximum of 10 models can be evaluated at a time.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- The request body can be in OpenAI, TGI, or custom format.
- The OpenAI format is a large model request format developed and standardized by OpenAI.
- The TGI format is a large model request format launched by the HuggingFace team. The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body. For details about JsonPath, see https://github.com/json-path/JsonPath.
Evaluation Configurations
Evaluation Rules
Based on large models: Use large models with stronger capabilities to automatically score the generated results of the evaluated models. This approach is applicable to open or complex Q&A scenarios.
Select mode
- grading model: The referee model automatically scores the model inference result based on the configured scoring criteria.
- Comparison mode: The referee model compares the performance of two models on each question. The comparison result can be win, lose, or tie. In comparison mode, two services must be selected as the service source. By default, the first service is selected as the benchmark model.
Evaluation Dataset
Select the dataset to be evaluated. In the NLP multi-turn Q&A scenario, only automatic evaluation based on large models is supported. You can select the multi-turn Q&A evaluation dataset.
Storage location of measurement results
Path for storing the model evaluation result.
Referee configuration
Referee Model
- Deploying services: Select a model deployed on ModelArts Studio for evaluation.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- The request body can be in OpenAI, TGI, or custom format.
- The OpenAI format is a large model request format developed and standardized by OpenAI.
- The TGI format is a large model request format launched by the HuggingFace team. The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body. For details about JsonPath, see https://github.com/json-path/JsonPath.
Scoring rules
You can select a preset or custom scoring prompt template. To create a custom prompt template, click Add custom rules and then click newly built. In the displayed dialog box, set the Prompt template name, avatar, Task description, Is contain question, Is contain reference answer, Score strategy, Evaluation metrics, and Notes, and click Save.
Basic information
Task Name
Enter the evaluation job name.
Description
Enter the evaluation job description.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- After setting the parameters, click Create Now. The Evaluation Task > Automatic evaluation page is displayed.
- When the status is Completed, you can click Evaluation Report in the Operation column to view the report and details of the evaluation job on the Evaluation Report page.
Creating a Manual Evaluation Job for a Third-Party Model
To create a manual evaluation job for a third-party model, perform the following steps:
- Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 3 My Spaces
- In the navigation pane, choose Evaluation Center > Evaluation Task. Click Create manual task in the upper right corner.
- On the Create manual Evaluation task page, set parameters by referring to Table 3.
Table 3 Parameters of creating a manual evaluation job for a third-party model Category
Parameter
Description
selection service
Assessment Type
Select Large Language Models.
Service source
Currently, only external services can be used to call APIs for evaluation. A maximum of 10 models can be evaluated at a time.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- The request body can be in OpenAI, TGI, or custom format.
- The OpenAI format is a large model request format developed and standardized by OpenAI.
- The TGI format is a large model request format launched by the HuggingFace team. The response body of the API must be entered based on the JsonPath syntax requirements. The JsonPath syntax is used to extract required data from the JSON field in the response body. For details about JsonPath, see https://github.com/json-path/JsonPath.
Evaluation Configurations
Evaluation Indicators
You can customize evaluation metrics and fill in evaluation standards.
Evaluation Dataset
Evaluation dataset.
Storage location of measurement results
Path for storing the model evaluation result.
Basic information
Task Name
Enter the evaluation job name.
Description
Enter the evaluation job description.
- External services: Access external models through APIs for evaluation. When selecting External services, you need to enter the API name, API address, request body, and response body of the external model.
- After setting the parameters, click Create Now. The Evaluation Task > Manual evaluation page is displayed.
- When the status is To be evaluated, you can click Online Evaluation in the Operation column to go to the evaluation page.
- Score the evaluation effect area as prompted. After all data is evaluated, click Submit.
- On the evaluation details page, enable Blind testing to hide the model name and perform blind testing.
- Click Doubt or Nullify to doubt or nullify a case. To cancel this operation, click Cancel Doubts or Cancel nullify.
- Click the Click to add a note area to add remarks.
- On the evaluation page, hold down the left mouse button to select the text content to be marked and click Mark to mark the content as key content.
Figure 4 Manual evaluation
- In the navigation pane, choose Evaluation Center > Evaluation task > Manual evaluation. Click Assessment report in the Operation column to view the model evaluation result.
After the evaluation is complete, go to the manual evaluation list page and click Manual Review to review the evaluation. After the review is complete, click Submit to submit the result.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot