Data Engineering
ModelArts Studio provides comprehensive data engineering functions. These functions cover key stages such as data acquisition, processing, labeling, evaluation, and publishing, enabling you to efficiently build high-quality training datasets and facilitate the successful implementation of AI applications. The functions are as follows:
- Data acquisition: You can easily import various types of data to ModelArts Studio, including text, image, video, and weather data. ModelArts Studio supports flexible data ingestion and multiple file formats to cater to the needs of different service scenarios..
- Data processing: ModelArts Studio provides powerful data processing functions, including data extraction, filtering, conversion, tagging, and scoring for text, video, image, and weather data. The platform provides dedicated cleaning operators for different types of datasets and allows you to create custom operators to meet personalized data cleaning requirements. These functions ensure the generation of high-quality training data to meet both business and model training needs. You can flexibly adjust the operator sequence and customize cleaning templates to enhance data cleaning efficiency, support large-scale data processing, and ensure that generated datasets meet training standards.
- Data synthesis: The platform allows you to use preset or custom data instructions to process pre-trained text, single-turn Q&A, and single-turn Q&A (with a system persona) datasets, and generate new data based on a specified number of epochs. Data synthesis generates a large volume of high-quality training data, which can be used for pre-training large models to enhance their generalization and performance.
- Data labeling: The platform allows you to label or re-label data to improve the quality of dataset annotations. You can flexibly choose from various labeling options tailored to different datasets, and facilitate labeling, review, and labeling task transfer. Additionally, the platform offers AI pre-labeling capabilities for both text and image datasets. Leveraging the intelligent capabilities of the Pangu models, this feature significantly reduces the workload and costs associated with manual labeling, thereby greatly improving overall labeling efficiency.
- Data evaluation: The platform evaluates the quality of processed data in multiple formats, such as text, images, and videos. It provides preset basic evaluation criteria that you can either adopt directly or customize to meet personalized data quality requirements. Detailed quality evaluation reports are then generated, enabling you to verify the accuracy, integrity, and consistency of your data. This ensures high-quality data prior to model training and guarantees the reliability and stability of models in real-world applications.
- Data proportioning: The platform allows you to flexibly adjust the data proportions in text or image datasets. You can select multiple datasets and adjust the proportions of data from different sources or types to optimize the model training process. The purpose of data splitting is to ensure that the model can more thoroughly learn and understand diverse data, thereby enhancing its generalization capability and performance.
- Data publishing: The platform allows you to publish datasets. You can publish a processed dataset in a variety of formats, including standard and Pangu formats. Especially for text and image datasets, the platform can convert them into Pangu format for training Pangu models, providing efficient data support for subsequent model training.
- Data management: The platform supports full-link lineage tracing. You can click a dataset name to view the operations performed on the dataset on the Data Lineage tab page. Full-link lineage tracing helps you analyze the impact of data sets in both forward and backward directions, quickly identify issues, and improve data O&M and governance efficiency. This also helps you better trace data sources. Moreover, the platform offers a comprehensive labeling system that supports data classification based on industry standards for both industry sectors and security levels, as well as built-in scenario classification labels. This facilitates data classification, data quality control, and data asset management, thereby enhancing the efficiency and effectiveness of data governance.
By integrating these features, data engineering not only enables you to efficiently create high-quality training datasets for AI research and development but also explores the intrinsic relationships between data and model performance through end-to-end data processing and management. This provides a robust data foundation for model training and application, promoting precise model training and continuous optimization, and ultimately improving the efficiency of AI application development and the reliability of outcomes.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot