Updated on 2025-07-02 GMT+08:00

Publishing Text Datasets

Data publishing refers to publishing a dataset in a specific format as a published dataset for subsequent model training operations.

Text datasets can be published in the following formats:

  • Standard format: original format supported by the data project function.
    The following is an example of the standard format, where context and target are key-value pairs.
    {"context": "Hello, please introduce yourself.","target": "I am a Pangu model."}
  • Pangu format: To train a Pangu model, publish the dataset in Pangu format.
    The following is an example of the Pangu format, where context and target are key-value pairs. Different from the standard format, context is an array.
    {"context": "Hello, please introduce yourself.","target": "I am a Pangu model."}

Creating a Text Dataset Publishing Task

To create a text dataset publishing task, perform the following steps:

  1. Log in to ModelArts Studio Large Model Deveopment Platform. In the My Spaces area, click the required workspace.
    Figure 1 My Spaces
  2. In the navigation pane, choose Data Engineering > Data Publishing > Publish Task. On the displayed page, click Create Data Publish Task in the upper right corner.
  3. On the Create Data Publish Task page, select a dataset modality, for example, Text > Pre-trained Text.
    Figure 2 Selecting dataset modality
  4. Select a dataset and click Next.
  5. In the Basic Configuration area, select the data usage, dataset visibility, and application scenario.

    Data engineering supports the interconnection with Pangu models. To ensure that these datasets can be properly trained by these large models, the platform supports the publishing of datasets in different formats.

    Currently, the standard and Pangu formats are supported.
    • Standard format: original format supported by the data project function. Datasets in this format can be published to assets but are not visible to downstream model development. This format is used when Application Scenario is set to Other.
    • Pangu format: data format required for Pangu model training. The dataset will be used for model development on ModelArts Studio. This format is used when Application Scenario is set to STUDIO Model Training.

    If this dataset is used to train Pangu models, set Select Format to Pangu format.

  6. Enter the dataset name and description, set extended information, and click Next.
  7. On the Task Configuration page, configure resources.
    Expand Resource Allocation to configure task resources. You can also customize parameters. Click Add Parameters and enter the parameter name and value.
    Figure 3 Resource Allocation
  8. After the task is configured, click OK to publish the dataset. If the task status is Succeeded, the data publishing task is successfully executed. You can choose Data Engineering > Data Publishing > Datasets in the navigation pane and click the Published Dataset tab to view the published dataset.