Updated on 2025-07-28 GMT+08:00

Introduction to Data Engineering

What Is Data Engineering?

Data engineering is a one-stop data processing and management feature provided by ModelArts Studio Large Model Development Platform. Its goal is to facilitate the efficient and accurate utilization of data for large-scale model training through data acquisition, processing, and publishing. This feature aids in efficient data management and processing, enhances data quality and processing efficiency, and lays a robust data foundation for large model development.

Data engineering provides the following functions:

  • Data acquisition: Data acquisition is the first step of data engineering. Data from different sources and in different formats can be imported to the platform. An original dataset can be generated.
    • Data can be imported through OBS.
    • The following data types are supported: Text, Image, Video, Audio, Weather, and Other.
    • Custom format: You can flexibly upload data in custom formats based on service requirements.

    With these functions, you can easily import a large amount of data to the platform to prepare for subsequent operations such as data processing and model training.

  • Data processing: The platform provides data processing, data synthesis, data labeling, and data combination to ensure that raw data meets various service requirements and model training standards. A processed dataset can be generated.
    • Data processing: Data processing aims to preprocess data by using dataset processing operators. Dedicated processing operators are designed for different types of datasets to ensure that data meets model training standards and service requirements.
    • Data synthesis: Using either a preset or custom data instruction, process the original data, and generate new data based on a specified number of epochs.
    • Data labeling: Data labeling aims to add accurate labels to unlabeled datasets. The quality of labeled data directly affects the training effect and accuracy of models. The platform supports manual labeling and AI pre-labeling for different datasets.

      Image captions and video captions support AI pre-labeling.

    • Data combination is a process of combining multiple datasets into a processed dataset based on a specific ratio. A proper ratio ensures the diversity, balance, and representativeness of datasets.

    Through data processing, the platform can effectively clear noise data and standardize data formats, helping improve the overall quality of datasets.

  • Data Publishing: The platform allows you to publish datasets in different modalities and formats and generates published datasets.
    • Data publishing refers to publishing a dataset in a specific format as a published dataset for subsequent model training operations.

      The following dataset formats are supported: standard format and Pangu format (applicable to Pangu model training). Currently, only text and image datasets can be published in Pangu format.

  • Data Management: The platform provides dataset management and data evaluation to manage datasets of different types, ensure that data meets the diversity, balance, and representativeness requirements of large model training through data quality evaluation, and promote efficient data circulation and application.
    • Datasets: The platform manages datasets of different types, such as original datasets, processed datasets, and published datasets.
    • Data Evaluation: Data evaluation checks the quality of datasets and evaluates multiple dimensions of data based on evaluation standards to detect and solve potential problems.

In addition to data acquisition, data processing, and data publishing, the platform supports one-stop management of original datasets, processed datasets, published datasets, and data synthesis instructions. When building large-scale datasets, the data engineering capabilities of ModelArts Studio offer significant flexibility and efficiency, enabling seamless collaboration in data processing workflows and rapid adaptation to evolving service and technical requirements.

Data Types Supported by the Platform

Table 1 lists the data types supported by ModelArts Studio. For details about the data format requirements of each type, see Dataset Format Requirements.

Table 1 Data types supported by the platform

Data Type

Content

Supported File Format

Text

Document

txt, mobi, epub, docx, and pdf

Web page

html

Pre-trained text

jsonl

Single-turn Q&A

jsonl and csv

Single-turn Q&A (with a system persona)

jsonl and csv

Multi-turn Q&A

jsonl

Multi-turn Q&A (with a system persona)

jsonl

Q&A ranking

jsonl and csv

Direct Preference Optimization (DPO)

jsonl

DPO (with a system persona)

jsonl

Image

Image Only

jpg, jpeg, png, bmp, and tar

Image + Caption

  • Supported image formats: JPG, JPEG, PNG, and BMP. All images must be saved as TAR packages.
  • Captions can be in JSONL format. JSONL files support only UTF-8 encoding.

Image + QA Pair

  • Supported image formats: JPG, JPEG, PNG, and BMP. All images must be saved as TAR packages.
  • QA pair format: JSONL. JSONL files support only UTF-8 encoding.

Object detection

  • Supported image formats: JPG, JPEG, PNG, BMP, TIF, and TIFF
  • Annotation file format: XML

Image classification

  • Supported image formats: JPG, JPEG, PNG, BMP, TIF, and TIFF
  • Annotation file format: TXT

Anomaly detection

  • Supported image formats: JPG, JPEG, PNG, and BMP
  • Annotation file format: TXT

Semantic segmentation

  • Image + XML
    • Supported image formats: JPG, JPEG, PNG, and BMP
    • Annotation file format: XML
  • Original image + annotated image + JSON
    • Supported image formats: JPG, JPEG, PNG, and BMP
    • Annotation file format: annotated image + JSON
  • Original image + Labeled image + TXT
    • Supported image formats: JPG, JPEG, PNG, and BMP
    • Annotation file format: annotated image + TXT
  • Original image + annotated image: The original and annotated images can be in JPG, JPEG, PNG, or BMP format.
  • Image + PNG

Pose estimation

  • Image + JSON
    • Supported image formats: JPG, JPEG, PNG, and BMP
    • Annotation file format: JSON
  • Image + XML
    • Supported image formats: JPG, JPEG, PNG, and BMP
    • Annotation file format: XML. Each image corresponds to an annotation file.

Instance segmentation

  • Supported image formats: JPG, JPEG, PNG, and BMP
  • Annotation file format: XML

Change detection

  • Supported image formats: JPG, JPEG, and BMP
  • Annotation file format: PNG

Video

Video

MP4 and AVI

Video + Annotation

  • Supported video formats: MP4 and AVI
  • Annotation file format: JSONL. JSONL files support only UTF-8 encoding.

Video classification

Supported file format: video + TXT. Video format: MP4 and AVI. Annotation file format: TXT. Each video corresponds to an annotation file.

Event detection

  • Supported video formats: MP4 and AVI. The duration of each video is greater than or equal to 128s. The FPS is greater than or equal to 10.
  • Annotation file format: JSON. A video can corresponds to one or multiple annotation files.

Audio

Audio Only

mp3, flac, wav, opus, aac, and m4a All audio files can be stored in multiple folders. Each folder can contain audio files in different formats.

Audio + Annotation

  • Supported audio formats: mp3, flac, wav, opus, aac, and m4a
  • Annotation file format: JSONL

    Note: JSONL files support only the UTF-8 encoding.

Weather

Weather Data

nc, cdf, netcdf, gr, gr1, grb, grib, grb1, grib1, gr2, grb2, and grib2

Other

Customization

You can customize dataset types based on specific scenarios.

Operations Supported by Each Type of Data

Table 2 lists the data engineering operations supported by each type of data.

Table 2 Operations supported by each type of data

Data Type

Data Acquisition

Data Processing

Data Synthesis

Data Labeling

Data Combination

Data Evaluation

Data Publishing

Text

Image

-

Audio

-

-

-

Weather

-

-

-

-

Other

-

-

-

-

-