Dataset Processing Scenarios

Introduction to Data Processing

ModelArts Studio provides the data processing function, covering data processing, data synthesis, and data labeling. This function ensures that the original data meets service requirements and model training standards. This function is the core of data engineering.

Data processing
Use dedicated processing operators to preprocess data, ensuring it meets the model training standards and service requirements. Different types of datasets utilize operators specially designed for removing noise and redundant information, to enhance data quality. In addition, you can create custom operators to flexibly process data based on specific service scenarios and model requirements. This further optimizes the data processing process and improves the accuracy and robustness of models.
Data synthesis
Using either a preset or custom data instruction, process the original data, and generate new data based on a specified number of epochs. This process can extend the dataset to some extent and enhance the diversity and generalization capability of the trained model.
Data labeling
Add accurate labels to unlabeled datasets to ensure high-quality data required for model training. The platform supports both manual annotation and AI pre-annotation. You can choose an appropriate annotation method based on your needs. The quality of data labeling directly impacts the training effectiveness and accuracy of the model.
Data combination
Dataset combination involves combining multiple datasets based on a specific ratio into a processed dataset. A proper ratio ensures the diversity, balance, and representativeness of datasets and avoids issues resulting from uneven data distribution.
Using large models for data processing, synthesis, and labeling
Use large models to assist in data processing, synthesis, and labeling, and improve the intelligence of data processing. The platform allows you to register models deployed on ModelArts and third-party services. You can select a proper large model as required.

Through data processing, the platform can effectively clear noise data and standardize data formats, helping improve the overall quality of datasets. Data processing aims at optimization based on data types and service scenarios to provide high-quality input for model training and improve model performance.

Significance of Data Processing

Data processing plays an important role in large model development. It provides the following benefits:

Improved data quality
Raw data usually contains noise, missing values, or inconsistency, which directly affects the model training effect. Through data processing, invalid information can be effectively removed and missing data can be filled to ensure data accuracy and consistency, thereby improving data quality and providing reliable input for model training.
Improved diversity and generalization capabilities of datasets
When the data volume is insufficient or samples are unbalanced, data synthesis can generate new data and expand the scale and diversity of datasets. By increasing data diversity, the generalization capabilities of the model in various scenarios can be improved, and the adaptability of the model to unknown data can be enhanced.
Enhanced effectiveness of model training
High-quality data is the basis for model training. Data processing aims at targeted optimization based on data types and business requirements to make data more compliant with training standards and improve training efficiency and model accuracy.
Ensured alignment with business requirements
Different business scenarios and model applications have different requirements on data. Data processing involves customized processing of data based on specific business requirements to ensure that data meets application scenario requirements, improve the alignment between data and the model, and improve the accuracy of business decision-making and the model accuracy.
Higher efficiency
With the automated data processing function provided by the platform, you can efficiently preprocess a large amount of data. This reduces manual intervention, improves data processing consistency and efficiency, and ensures smooth operation of the entire data engineering process.
Ensured data quality and adaptation
Data combination based on a specified ratio ensures that datasets meet the high requirements of large model training. This not only includes the requirements on the data scale, but also covers the data quality, balance, and representativeness. This way, data balance or diversity can be ensured, and the accuracy and robustness of the model can be improved.
Improved data diversity and representativeness
You can combine multiple datasets based on a specific ratio to ensure the diversity and representativeness of datasets in different task scenarios. This avoids excessive bias towards a certain type of data, ensures that the model can learn multiple features, and improves the adaptability to various situations.

In general, this function not only improves data handling efficiency, but also supports efficient model training by optimizing data quality and targeted processing. Through data processing, you can quickly build high-quality datasets and promote the successful development of large models.

Dataset Types That Support Data Processing

Table 1 lists the dataset types that support data processing.

**Table 1** Dataset types that support data processing
Dataset Modality	Dataset Type	Data Processing	Data Synthesis	Data Labeling	Data Combination
Text	Document	√	-	-	-
	Web page	√	-	-	-
	Pre-trained text	√	√	-	√
	Single-turn Q&A	√	√	√	√
	Single-turn Q&A (with a system persona)	√	√	√	√
	Multi-turn Q&A	√	-	√	√
	Multi-turn Q&A (with a system persona)	√	-	√	√
	Q&A sorting	√	-	√	√
	Direct Preference Optimization (DPO)	-	-	-	√
	DPO (with a system persona)	-	-	-	√
Other	Other	√ (Only custom operators can be used for data processing.)	-	-	-