Introduction to Data Engineering
What Is Data Engineering?
Data engineering is a one-stop data processing and management feature provided by ModelArts Studio Large Model Development Platform. Its goal is to facilitate the efficient and accurate utilization of data for large-scale model training through data acquisition, processing, and publishing. This feature aids in efficient data management and processing, enhances data quality and processing efficiency, and lays a robust data foundation for large model development.
Data engineering provides the following functions:
- Data acquisition: Data acquisition is the first step of data engineering. Data from different sources and in different formats can be imported to the platform. An original dataset can be generated.
- Data can be imported through OBS.
- The following data types are supported: Text, Image, Video, Audio, Weather, and Other.
- Custom format: You can flexibly upload data in custom formats based on service requirements.
With these functions, you can easily import a large amount of data to the platform to prepare for subsequent operations such as data processing and model training.
- Data processing: The platform provides data processing, data synthesis, data labeling, and data combination to ensure that raw data meets various service requirements and model training standards. A processed dataset can be generated.
- Data processing: Data processing aims to preprocess data by using dataset processing operators. Dedicated processing operators are designed for different types of datasets to ensure that data meets model training standards and service requirements.
- Data synthesis: Using either a preset or custom data instruction, process the original data, and generate new data based on a specified number of epochs.
- Data labeling: Data labeling aims to add accurate labels to unlabeled datasets. The quality of labeled data directly affects the training effect and accuracy of models. The platform supports manual labeling and AI pre-labeling for different datasets.
- Data combination is a process of combining multiple datasets into a processed dataset based on a specific ratio. A proper ratio ensures the diversity, balance, and representativeness of datasets.
Through data processing, the platform can effectively clear noise data and standardize data formats, helping improve the overall quality of datasets.
- Data Publishing: The platform allows you to publish datasets in different modalities and formats and generates published datasets.
- Data publishing refers to publishing a dataset in a specific format as a published dataset for subsequent model training operations.
The following dataset formats are supported: standard format and Pangu format (applicable to Pangu model training). Currently, only text and image datasets can be published in Pangu format.
- Data publishing refers to publishing a dataset in a specific format as a published dataset for subsequent model training operations.
- Data Management: The platform provides dataset management and data evaluation to manage datasets of different types, ensure that data meets the diversity, balance, and representativeness requirements of large model training through data quality evaluation, and promote efficient data circulation and application.
- Datasets: The platform manages datasets of different types, such as original datasets, processed datasets, and published datasets.
- Data Evaluation: Data evaluation checks the quality of datasets and evaluates multiple dimensions of data based on evaluation standards to detect and solve potential problems.
In addition to data acquisition, data processing, and data publishing, the platform supports one-stop management of original datasets, processed datasets, published datasets, and data synthesis instructions. When building large-scale datasets, the data engineering capabilities of ModelArts Studio offer significant flexibility and efficiency, enabling seamless collaboration in data processing workflows and rapid adaptation to evolving service and technical requirements.
Data Types Supported by the Platform
Table 1 lists the data types supported by ModelArts Studio. For details about the data format requirements of each type, see Dataset Format Requirements.
Data Type |
Content |
Supported File Format |
---|---|---|
Text |
Document |
txt, mobi, epub, docx, and pdf |
Web page |
html |
|
Pre-trained text |
jsonl |
|
Single-turn Q&A |
jsonl and csv |
|
Single-turn Q&A (with a system persona) |
jsonl and csv |
|
Multi-turn Q&A |
jsonl |
|
Multi-turn Q&A (with a system persona) |
jsonl |
|
Q&A ranking |
jsonl and csv |
|
Direct Preference Optimization (DPO) |
jsonl |
|
DPO (with a system persona) |
jsonl |
|
Image |
Image Only |
jpg, jpeg, png, bmp, and tar |
Image + Caption |
|
|
Image + QA Pair |
|
|
Object detection |
|
|
Image classification |
|
|
Anomaly detection |
|
|
Semantic segmentation |
|
|
Pose estimation |
|
|
Instance segmentation |
|
|
Change detection |
|
|
Video |
Video |
MP4 and AVI |
Video + Annotation |
|
|
Video classification |
Supported file format: video + TXT. Video format: MP4 and AVI. Annotation file format: TXT. Each video corresponds to an annotation file. |
|
Event detection |
|
|
Audio |
Audio Only |
mp3, flac, wav, opus, aac, and m4a All audio files can be stored in multiple folders. Each folder can contain audio files in different formats. |
Audio + Annotation |
||
Weather |
Weather Data |
nc, cdf, netcdf, gr, gr1, grb, grib, grb1, grib1, gr2, grb2, and grib2 |
Other |
Customization |
You can customize dataset types based on specific scenarios. |
Operations Supported by Each Type of Data
Table 2 lists the data engineering operations supported by each type of data.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot