Updated on 2025-07-28 GMT+08:00

Dataset Processing Scenarios

Introduction to Data Processing

ModelArts Studio provides the data processing function, covering data processing, data synthesis, and data labeling. This function ensures that the original data meets service requirements and model training standards. This function is the core of data engineering.

  • Data processing

    Use dedicated processing operators to preprocess data, ensuring it meets the model training standards and service requirements. Different types of datasets utilize operators specially designed for removing noise and redundant information, to enhance data quality. In addition, you can create custom operators to flexibly process data based on specific service scenarios and model requirements. This further optimizes the data processing process and improves the accuracy and robustness of models.

  • Data synthesis

    Using either a preset or custom data instruction, process the original data, and generate new data based on a specified number of epochs. This process can extend the dataset to some extent and enhance the diversity and generalization capability of the trained model.

  • Data labeling

    Add accurate labels to unlabeled datasets to ensure high-quality data required for model training. The platform supports both manual annotation and AI pre-annotation. You can choose an appropriate annotation method based on your needs. The quality of data labeling directly impacts the training effectiveness and accuracy of the model.

  • Data combination

    Dataset combination involves combining multiple datasets based on a specific ratio into a processed dataset. A proper ratio ensures the diversity, balance, and representativeness of datasets and avoids issues resulting from uneven data distribution.

Through data processing, the platform can effectively clear noise data and standardize data formats, helping improve the overall quality of datasets. Data processing aims at optimization based on data types and service scenarios to provide high-quality input for model training and improve model performance.

Significance of Data Processing

Data processing plays an important role in large model development. It provides the following benefits:

  • Improved data quality

    Raw data usually contains noise, missing values, or inconsistency, which directly affects the model training effect. Through data processing, invalid information can be effectively removed and missing data can be filled to ensure data accuracy and consistency, thereby improving data quality and providing reliable input for model training.

  • Improved diversity and generalization capabilities of datasets

    When the data volume is insufficient or samples are unbalanced, data synthesis can generate new data and expand the scale and diversity of datasets. By increasing data diversity, the generalization capabilities of the model in various scenarios can be improved, and the adaptability of the model to unknown data can be enhanced.

  • Enhanced effectiveness of model training

    High-quality data is the basis for model training. Data processing aims at targeted optimization based on data types and business requirements to make data more compliant with training standards and improve training efficiency and model accuracy.

  • Ensured alignment with business requirements

    Different business scenarios and model applications have different requirements on data. Data processing involves customized processing of data based on specific business requirements to ensure that data meets application scenario requirements, improve the alignment between data and the model, and improve the accuracy of business decision-making and the model accuracy.

  • Higher efficiency

    With the automated data processing function provided by the platform, you can efficiently preprocess a large amount of data. This reduces manual intervention, improves data processing consistency and efficiency, and ensures smooth operation of the entire data engineering process.

  • Ensured data quality and adaptation

    Data combination based on a specified ratio ensures that datasets meet the high requirements of large model training. This not only includes the requirements on the data scale, but also covers the data quality, balance, and representativeness. This way, data balance or diversity can be ensured, and the accuracy and robustness of the model can be improved.

  • Improved data diversity and representativeness

    You can combine multiple datasets based on a specific ratio to ensure the diversity and representativeness of datasets in different task scenarios. This avoids excessive bias towards a certain type of data, ensures that the model can learn multiple features, and improves the adaptability to various situations.

In general, this function not only improves data handling efficiency, but also supports efficient model training by optimizing data quality and targeted processing. Through data processing, you can quickly build high-quality datasets and promote the successful development of large models.

Dataset Types That Support Data Processing

Table 1 lists the dataset types that support data processing.

Table 1 Dataset types that support data processing

Dataset Modality

Dataset Type

Data Processing

Data Synthesis

Data Labeling

Data Combination

Text

Documents

-

-

Web pages

-

-

Pre-trained text

-

Single-turn Q&A

Single-turn Q&A (with a system persona)

Multi-turn Q&A

-

Multi-turn Q&A (with a system persona)

-

Q&A ranking

-

Direct Preference Optimization (DPO)

-

-

-

DPO (with a system persona)

-

-

-

Image

Image

-

Image + Caption

-

Image + QA Pair

-

Object detection

-

-

-

-

Image classification

-

-

-

-

Anomaly detection

-

-

-

-

Semantic segmentation

-

-

-

-

Pose estimation

-

-

-

-

Instance segmentation

-

-

-

-

Change detection

-

-

-

-

Video

Videos

-

-

Video + Annotation

-

-

-

-

Video content comprehension

-

-

Event detection

-

-

-

-

Video classification

-

-

-

-

Audio

Audio Only

-

-

Audio + Annotation

-

-

-

-

Weather

Weather Data

-

-

-

Prediction

Time series (classification)

-

-

-

Time series (regression)

-

-

-

-

Structured (classification)

-

-

-

Structured (regression)

-

-

-

-

Other

Other

√ (Only custom operators can be used for data processing.)

-

-

-