Introduction to Data Engineering

What Is Data Engineering?

Data engineering is a one-stop data processing and management feature provided by ModelArts Studio Large Model Development Platform. Its goal is to facilitate the efficient and accurate utilization of data for large-scale model training through data acquisition, processing, and publishing. This feature aids in efficient data management and processing, enhances data quality and processing efficiency, and lays a robust data foundation for large model development.

Data engineering provides the following functions:

Data acquisition: Data acquisition is the first step of data engineering. Data from different sources and in different formats can be imported to the platform. An original dataset can be generated.
- Data can be imported through OBS.
- Supported data types: text and other.
- Custom format: You can flexibly upload data in custom formats based on service requirements.
With these functions, you can easily import a large amount of data to the platform to prepare for subsequent operations such as data processing and model training.

Data processing: The platform provides data processing, data synthesis, data labeling, and data combination to ensure that raw data meets various service requirements and model training standards. A processed dataset can be generated.
- Data processing: Data processing aims to preprocess data by using dataset processing operators. Dedicated processing operators are designed for different types of datasets to ensure that data meets model training standards and service requirements. For example, when data is preprocessed using processing operators, redundant characters in the text can be removed and the image size can be adjusted to meet the format requirements of Pangu model training.
- Data synthesis: Data synthesis involves using either a preset or custom data instruction to process the original data and generate new data. For example, it generates a single turn of Q&A based on the given text (with a persona). If the data volume is insufficient, these new data samples can help the model learn and generalize better.
- Data labeling: Data labeling aims to add accurate labels to unlabeled datasets, helping the model understand the relationship between the input and output, and thereby effectively learn and predict. The quality of labeled data directly affects the training effect and accuracy of models.
- Data combination: Data combination is a process of combining multiple datasets into a processed dataset based on a specific ratio. A proper ratio ensures the diversity, balance, and representativeness of datasets.
Through data processing, the platform can effectively clear noise data and standardize data formats, helping improve the overall quality of datasets.
Data Publishing: The platform allows you to publish datasets in different modalities and formats and generates published datasets.
- Data publishing refers to publishing a dataset in a specific format as a published dataset for subsequent model training operations.
  The following dataset formats are supported: standard format and Pangu format (applicable to Pangu model training). Currently, only text and image datasets can be published in Pangu format.
Data Management: The platform provides dataset management and data evaluation to manage datasets of different types, ensure that data meets the diversity, balance, and representativeness requirements of large model training through data quality evaluation, and promote efficient data circulation and application.
- Datasets: The platform manages datasets of different types, such as original datasets, processed datasets, and published datasets.
- Data Evaluation: Data evaluation checks the quality of datasets and evaluates multiple dimensions of data based on evaluation standards to detect and solve potential problems.

In addition to data acquisition, data processing, and data publishing, the platform supports one-stop management of original datasets, processed datasets, published datasets, and data synthesis instructions. When building large-scale datasets, the data engineering capabilities of ModelArts Studio offer significant flexibility and efficiency, enabling seamless collaboration in data processing workflows and rapid adaptation to evolving service and technical requirements.

Data Types Supported by the Platform

Table 1 lists the data types supported by ModelArts Studio. For details about the data format requirements of each type, see Dataset Format Requirements.

**Table 1** Data types supported by the platform
Data Type	Content	Supported File Format
Text	Document	txt, mobi, epub, docx, and pdf
	Web page	html
	Pre-trained text	jsonl
	Single-turn Q&A	jsonl and csv
	Single-turn Q&A (with a system persona)	jsonl and csv
	Multi-turn Q&A	jsonl
	Multi-turn Q&A (with a system persona)	jsonl
	Q&A Sorting	jsonl and csv
	Direct Preference Optimization (DPO)	jsonl
	DPO (with a system persona)	jsonl
Other	Customization	You can customize dataset types based on specific scenarios.