Updated on 2025-07-02 GMT+08:00

Publishing Image Datasets

Data publishing refers to publishing a dataset in a specific format as a published dataset for subsequent model training operations.

Image datasets can be published in the following formats:

  • Standard format: It is the default dataset format on the platform, as shown in Figure 1. Datasets in this format can be published to assets but are not visible to downstream model development.
    Figure 1 Example of an image dataset in Standard format
  • Pangu format: To train a Pangu model, publish the dataset in Pangu format, as shown in Figure 2. The dataset will be used for model development on the ModelArts Studio Large Model Development Platform.
    Figure 2 Example of an image dataset in Pangu format

Creating an Image Dataset Publishing Task

To create an image dataset publishing task, perform the following steps:

  1. Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
    Figure 3 My Spaces
  2. In the navigation pane, choose Data Engineering > Data Publishing > Publish Task. On the displayed page, click Create Data Publish Task in the upper right corner.
  3. On the Create Data Publish Task page, select a dataset modality, for example, Image > Image + Caption.
    Figure 4 Selecting dataset modality
  4. Select a dataset and click Next.
  5. In the Basic Configuration area, select the data usage, dataset visibility, and application scenario.

    Data engineering supports the interconnection with Pangu models. To ensure that these datasets can be properly trained by these large models, the platform supports the publishing of datasets in different formats.

    Currently, the standard and Pangu formats are supported.
    • Standard format: original format supported by the data project function. Datasets in this format can be published to assets but are not visible to downstream model development.
    • Pangu format: data format required for Pangu model training. The dataset will be used for model development on ModelArts Studio.

    If this dataset is used to train Pangu models, set Format Configuration to Pangu format.

  6. Select Dataset splitting if required. If it is selected, set the data ratio of the training set to the validation set, as shown in Figure 5.
    Figure 5 Dataset splitting
  7. Enter the dataset name and description, set extended information, and click OK to publish the dataset.

    If the task status is Succeeded, the data publishing task is successfully executed. You can choose Data Engineering > Data Publishing > Datasets in the navigation pane and click the Published Dataset tab to view the published dataset.

  8. On the Publish Task page, click the task ID to view the task details. On the Data Publish Detail page, you can click the Basic Info and Log Management tabs. The Basic Info tab page contains the Job Detail Info, Configuration Data, Data Source, and Generate Dataset areas. In the Generate Dataset table, the number of records in the training set and validation set is displayed, as shown in Figure 7.
    Figure 6 Viewing task details
    Figure 7 Example of the Generate Dataset table