Updated on 2025-07-28 GMT+08:00

Dataset Publishing Scenarios

Introduction to Dataset Publishing

ModelArts Studio provides data evaluation and data publishing functions. It aims to ensure that data meets the diversity, balance, and representativeness requirements of large model training through data quality evaluation, promoting efficient data circulation and application.

Data publishing includes not only publishing data in a suitable format, but also evaluating the dataset effectiveness based on task requirements to ensure that the dataset meets the standards of model training in terms of scale, quality, and content.

  • Data evaluation

    The platform offers predefined evaluation standards for multiple types of data. You can choose from these predefined standards or customize evaluation standards as needed to precisely improve data quality, ensure that data meets high standards, and enhance model performance.

  • Data publishing

    Data publishing refers to publishing a dataset in a specific format as a published dataset for subsequent model training operations. The following formats are supported: standard format and Pangu format (applicable to Pangu model training). Currently, only text and image datasets can be published in Pangu format.

With these functions, the platform can help you scientifically manage and publish datasets to ensure that the dataset quality meets the requirements of large model training, thereby improving the effect of subsequent model training.

Significance of Data Publishing

Data publishing allows you to convert data into different formats, and to evaluate the dataset effect based on task requirements to ensure that the data meets the training standards in terms of scale, quality, and content. Specifically, data publishing has the following significance:

  • Support for multiple formats

    For text and image datasets, the platform supports multiple data formats, including standard format and Pangu format, to meet different training requirements. You can convert datasets into a required format to ensure that the data is compatible with specific models (such as Pangu models) and to optimize the training effect.

  • Improved training efficiency

    Publishing datasets that comply with standards can greatly improve data processing efficiency, reduce subsequent adjustment workload, and help you quickly enter the model training phase.

Dataset publishing is a key step in data engineering to ensure that datasets meet model training requirements. With the data publishing function provided by the platform, you can flexibly select the format of published data based on specific task requirements to ensure data compatibility and consistency, laying a solid foundation for subsequent model training and application deployment.

Dataset Types That Support Data Publishing

Table 1 lists the dataset types that support data publishing.

Table 1 Dataset types that support data publishing

Data Type

Data Evaluation

Data Publishing

Text

Image

Video

Weather

-

Prediction

-

Other

-

ModelArts Studio allows you to publish text and image datasets in either of the following formats:

  • Standard format: applies to a wide range of data application scenarios and meets most standard requirements of model training. Datasets in this format are published to assets, but are invisible to downstream model development personnel.
  • Pangu format: a format designed for Pangu model training to ensure the compatibility and consistency of datasets in Pangu model training. Datasets in this format will be used for model development on ModelArts Studio.

Datasets other than text and image datasets can only be published as standard datasets.