Evaluating Text Datasets

Before publishing a text dataset, ModelArts Studio allows you to evaluate the dataset to optimize data quality, ensure that the data meets high standards, and improve model performance.

If data evaluation is not required, skip this section and proceed with Publishing Text Datasets.

Creating Evaluation Standards for Text Datasets

ModelArts Studio provides preset evaluation standards for text datasets, covering multiple dimensions such as data accuracy, integrity, consistency, and format specifications. You can directly use those standards or create your evaluation standards.

If you want to use the preset evaluation standards, skip this section and proceed with Creating a Text Dataset Evaluation Task.

To create evaluation standards for text datasets, perform the following steps:

Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 1 My Spaces
In the navigation pane, choose Data Engineering > Data Management > Data Evaluation. On the Manual Evaluation Standard tab page, the preset text dataset evaluation standard NLP Data Quality Standard V1.0 are displayed. You can click the evaluation criterion name to view the specific evaluation items.
Figure 2 Preset evaluation standards for text datasets
On the Manual Evaluation Standard page, click Create custom standards, select Preset Standard, and set the evaluation standard name and description.
Edit evaluation items.
You can delete evaluation items or create custom evaluation items as required. When creating a custom evaluation item, ensure that the evaluation type, evaluation item, and evaluation item description are clear and unambiguous.
Click Complete Creation.
After evaluation standards are created, you can view, edit, and delete them on the Manual Evaluation Standard page.

Creating a Text Dataset Evaluation Task

Only processed datasets can be evaluated.

Before creating a text dataset evaluation task, generate a processed dataset by referring to Processing Text Datasets.

To create a text dataset evaluation task, perform the following steps:

Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
Figure 3 My Spaces
In the navigation pane, choose Data Engineering > Data Management > Data Evaluation. On the displayed page, click Create Evaluation Task in the upper right corner.
Select the dataset to be evaluated and set the number of samples.
Click Next Step and select the evaluation standard. Click Next Step to set the evaluator and click Next Step to enter the task name.
Click Complete Creation to go to the Data Evaluation page. After the evaluation task is created, the status becomes Created.
Click Evaluate in the Operation column.
On the evaluation page, label the problems of the current data by referring to the evaluation items. If the data meets the requirements, click Passed. If the data does not meet the requirements, click Not passed.
For a text dataset, you can right-click the question content and choose the corresponding problem type from the shortcut menu as shown in Figure 4.

Figure 4 Marking dataset problems
After all data is evaluated, check that the evaluation progress is 100% on the Manual Evaluation page.
Click Report in the Operation column to view the dataset quality evaluation report.