Data Engineering

ModelArts Studio provides comprehensive data engineering functions. These functions cover key stages such as data acquisition, processing, labeling, evaluation, and publishing, enabling you to efficiently build high-quality training datasets and facilitate the successful implementation of AI applications. The functions are as follows:

Figure 1 Data processes

Click to enlarge

**Table 1** Operations supported by each type of data
Data Type	Data Acquisition	Data Processing	Data Synthesis	Data Labeling	Data Combination	Data Evaluation	Data Publishing
Text	√	√	√	√	√	√	√
Other	√	√	-	-	-	-	√

Data acquisition: You can easily import various types of data to ModelArts Studio, including text data and other types of data defined by users. ModelArts Studio supports flexible data ingestion and multiple file formats.
Data processing: ModelArts Studio provides powerful data processing functions, including data extraction, filtering, conversion, tagging, and scoring for text data. The platform provides dedicated cleaning operators for different types of datasets and allows you to create custom operators to meet personalized data cleaning requirements. These functions ensure the generation of high-quality training data to meet both business and model training needs. You can flexibly adjust the operator sequence and customize cleaning templates to enhance data cleaning efficiency, support large-scale data processing, and ensure that generated datasets meet training standards.
Data synthesis: The platform allows you to use preset or custom data instructions to process pre-trained text, single-turn Q&A, and single-turn Q&A (with a system persona) datasets, and generate new data based on a specified number of epochs. Data synthesis generates a large volume of high-quality training data, which can be used for pre-training large models to enhance their generalization and performance.
Data labeling: The platform allows you to label or re-label data to improve the quality of dataset annotations. You can flexibly choose from various labeling options tailored to different datasets, and facilitate labeling, review, and labeling task transfer. Additionally, the platform offers AI pre-labeling capabilities for text datasets. Leveraging the intelligent capabilities of the Pangu models, this feature significantly reduces the workload and costs associated with manual labeling, thereby greatly improving overall labeling efficiency.
Data proportioning: The platform allows you to flexibly adjust the data proportions in text datasets. You can select multiple datasets and adjust the proportions of data from different sources or types to optimize the model training process. The purpose of data splitting is to ensure that the model can more thoroughly learn and understand diverse data, thereby enhancing its generalization capability and performance.
Data publishing: The platform allows you to publish datasets. You can publish a processed dataset in a variety of formats, including standard and Pangu formats. Especially for text datasets, the platform can convert them into Pangu format for training Pangu models, providing efficient data support for subsequent model training.
Data management: The platform supports full-link lineage tracing. You can click a dataset name to view the operations performed on the dataset on the Data Lineage tab page. Full-link lineage tracing helps you analyze the impact of data sets in both forward and backward directions, quickly identify issues, and improve data O&M and governance efficiency. This also helps you better trace data sources. Moreover, the platform offers a comprehensive labeling system that supports data classification based on industry standards for both industry sectors and security levels, as well as built-in scenario classification labels. This facilitates data classification, data quality control, and data asset management, thereby enhancing the efficiency and effectiveness of data governance.

By integrating these features, data engineering not only enables you to efficiently create high-quality training datasets for AI research and development but also explores the intrinsic relationships between data and model performance through end-to-end data processing and management. This provides a robust data foundation for model training and application, promoting precise model training and continuous optimization, and ultimately improving the efficiency of AI application development and the reliability of outcomes.

Parent topic: Functions

Previous topic: Workspace Management

Next topic: Model Development