Updated on 2025-07-28 GMT+08:00

Process of Using Data Engineering

High-quality data is the foundation for continuous iteration and optimization of large models. Its quality directly determines the performance, generalization capability, and adaptability of models to different application scenarios. Only through systematic preparation and processing of data can valuable insights be extracted, thereby providing better support for model training. Therefore, data acquisition, cleaning, labeling, evaluation, and publishing are indispensable steps in data development.

For details about the data engineering process, see Figure 1 and Table 1.

Figure 1 Dataset construction flowchart
Table 1 Dataset construction process

Procedure

Step

Description

Importing data to the Pangu platform

Creating an import task

Import data stored in OBS or local data into the platform for centralized management, facilitating subsequent processing or publishing.

Processing datasets

Processing datasets

Use dedicated processing operators to preprocess data, ensuring it meets the model training standards and service requirements. Different types of datasets utilize operators specially designed for removing noise and redundant information, to enhance data quality.

Generating a synthetic dataset

Using either a preset or custom data instruction, process the original data, and generate new data based on a specified number of epochs. This process can extend the dataset to some extent and enhance the diversity and generalization capability of the trained model.

Labeling datasets

Add accurate labels to unlabeled datasets to ensure high-quality data required for model training. The platform supports both manual annotation and AI pre-annotation. You can choose an appropriate annotation method based on your needs. The quality of data labeling directly impacts the training effectiveness and accuracy of the model.

Combining datasets based on a specific ratio

Dataset combination involves combining multiple datasets based on a specific ratio and generating a processed dataset. A proper ratio ensures the diversity, balance, and representativeness of datasets and avoids issues resulting from uneven data distribution.

Publishing datasets

Evaluating datasets

The platform offers predefined evaluation standards for multiple types of data. You can choose from these predefined standards or customize evaluation standards as needed to precisely improve data quality, ensure that data meets high standards, and enhance model performance.

Publishing datasets

Data publishing refers to publishing a dataset in a specific format as a published dataset for subsequent model training operations.

Datasets can be published in standard or Pangu format.

  • Standard format: default dataset format on the platform. Datasets in this format can be published as assets, but cannot be used in the development of Pangu models.
  • Pangu format: When training a Pangu model, you need to publish it in this format. Currently, only text and image datasets can be published in Pangu format.