Updated on 2025-08-29 GMT+08:00

Creating a ModelArts Dataset

Before using ModelArts to prepare data, create a dataset. Then, you can perform operations on the dataset, such as importing data, analyzing data, and labeling data.

Datasets are supported only in the following regions: CN North-Beijing4, CN Southwest-Guiyang1, CN-Hong Kong, AP-Singapore, AP-Bangkok, AP-Jakarta, AF-Johannesburg, LA-Santiago, LA-Sao Paulo1, and LA-Mexico City2.

Dataset Types

ModelArts supports the following types of datasets:

  • Images: in .jpg, .png, .jpeg, or .bmp format for image classification, image segmentation, and object detection
  • Audio: in .wav format for sound classification, speech labeling, and speech paragraph labeling
  • Text: in .txt or .csv format for text classification, named entity recognition, and text triplet labeling
  • Video: in .mp4 format for video labeling
  • Free format: allows data in any format. Labeling is not available for free format data. The free format applies if labeling is not required or needs to be customized. If your dataset needs to contain data in multiple formats or your data format does not meet the requirements of other types of datasets, you can select a dataset in free format.
  • Table

    Table: applies to structured data processing such as tables. The file format can be CSV. Tables cannot be labeled but you can preview up to 100 data records in a table.

Dataset Functions

Different types of datasets support different functions, such as auto labeling and team labeling. For details, see Table 1.

Table 1 Functions supported by different types of datasets

Dataset Type

Labeling Type

Creating a Dataset

Importing Data

Exporting Data

Publishing a Dataset

Modifying a dataset

Managing Dataset Versions

Auto labeling

Team labeling

Auto Grouping

Data Feature Engineering

Images

Image classification

Supported

Supported

Supported

Supported

Supported

Supported

Supported

Supported

Supported

Supported

Object detection

Supported

Supported

Supported

Supported

Supported

Supported

Supported

Supported

Supported

Supported

Image segmentation

Supported

Supported

Supported

Supported

Supported

Supported

-

-

Supported

-

Audio

Sound classification

Supported

Supported

-

Supported

Supported

Supported

-

-

-

-

Speech labeling

Supported

Supported

-

Supported

Supported

Supported

-

-

-

-

Speech paragraph labeling

Supported

Supported

-

Supported

Supported

Supported

-

Supported

-

-

Text

Text classification

Supported

Supported

-

Supported

Supported

Supported

-

Supported

-

-

Named entity recognition

Supported

Supported

-

Supported

Supported

Supported

-

Supported

-

-

Text triplet

Supported

Supported

-

Supported

Supported

Supported

-

Supported

-

-

Video

Video

Supported

Supported

-

Supported

Supported

Supported

-

-

-

-

Free format

Free format

Supported

-

_

Supported

Supported

Supported

-

-

-

-

Table

Table

Supported

Supported

-

Supported

Supported

Supported

-

-

-

-

Specifications Restrictions

  • The maximum numbers of samples and labels in a single text, video, or audio database other than a table dataset are 1,000,000 and 10,000, respectively.
  • The maximum size of a sample in a single text, video, or audio database other than an image dataset is 5 GB.
  • The maximum size of an image for object detection, image segmentation, or image classification is 25 MB.
  • The manifest file cannot be larger than 5 GB.
  • The text file in a line cannot be larger than 100 KB.
  • The dataset labeling result file cannot be larger than 100 MB.

Prerequisites

  • You have been authorized to access OBS. To do so, go to the ModelArts management console. In the navigation pane on the left, choose Permission Management, and add access authorization using an agency.
  • OBS buckets and folders for storing data are available. In addition, the OBS buckets and ModelArts are in the same region. OBS parallel file systems are not supported. Select object storage.
  • ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.

Creating a Dataset (Image, Audio, Text, Video, and Free Format)

  1. Log in to the ModelArts management console. In the navigation pane on the left, choose Asset Management >Datasets.
  2. Click Create. On the Create Dataset page, create a dataset based on the data type and data labeling requirements. Enter the basic information about the dataset.
    Figure 1 Parameters

    • Name: Enter a custom dataset name.
    • Description: Enter the details about the dataset
    • Data Type: Select a data type based on your needs.
    • Data Source
      1. Importing data from OBS

        If you have prepared data on OBS, set Data Source to OBS and configure Import Path, Labeling Status, and Output Dataset Path. If Labeling Status is set to Labeled, you need to also configure Labeling Format. The labeling formats of the input data vary depending on the dataset type. For details about the labeling formats supported by ModelArts, see Dataset Functions.

      2. Importing data from a local path

        ModelArts also allows you to upload data from a local path. To do so, set Data Source to Local file, upload data, and configure Labeling Status and Output Dataset Path. Click Upload data to select the local file for uploading. Select a labeling format when the labeling status is Labeled. The labeling formats of the input data vary depending on the dataset type. For details about the labeling formats supported by ModelArts, see Dataset Functions.

        Figure 2 Selecting Local file

    • For details about parameters, see Table 2.
      Table 2 Dataset parameters

      Parameter

      Description

      Import Path

      OBS path from which your data is to be imported. This path is used as the data storage path of the dataset.

      NOTE:

      OBS parallel file systems are not supported. Select an OBS bucket.

      When you create a dataset, data in the OBS path will be imported to the dataset. If you modify data in OBS, the data in the dataset will be inconsistent with that in OBS. As a result, certain data may be unavailable. If you need to modify data in a dataset, see Import Mode or Importing Data from an OBS Path to ModelArts.

      If the numbers of samples and labels of the dataset exceed quotas, importing the samples and labels will fail.

      Labeling Status

      Labeling status of the selected data, which can be Unlabeled or Labeled.

      If you select Labeled, specify a labeling format and ensure the data file complies with format specifications. Otherwise, the import may fail.

      Only image (object detection, image classification, and image segmentation), audio (sound classification), and text (text classification) labeling tasks support the import of labeled data.

      Output Dataset Path

      OBS path where your labeled data is stored.

      NOTE:
      • Ensure that your OBS path name contains letters, digits, and underscores (_) and does not contain special characters, such as ~'@#$%^&*{}[]:;+=<>/ and spaces.
      • The dataset output path cannot be the same as the data input path or subdirectory of the data input path.
      • It is a good practice to select an empty directory as the dataset output path.
      • OBS parallel file systems are not supported. Select an OBS bucket.

      Advanced Feature Settings - Import by Tag

      This function is disabled by default. You can enable it to import resources by tag.

      Import by Tag enables the system to automatically obtain the labels of the current dataset. Click Add Label to add a label. This field is optional. After importing the data, you can add or delete labels during data labeling.

  3. After setting the parameters, click Submit.

Creating a Dataset (Table)

  1. Log in to the ModelArts management console. In the navigation pane on the left, choose Asset Management > Datasets.
  2. Click Create. On the Create Dataset page, create a dataset based on the data type and data labeling requirements. Enter the basic information about the dataset.
    Figure 3 Parameters of a table dataset

    • Name: Enter a custom dataset name.
    • Description: Enter the details about the dataset
    • Data Type: Select a data type based on your needs.
    • For more details about parameters, see Table 3.
      Table 3 Dataset parameters

      Parameter

      Description

      Data Source (OBS)

      • File Path: Browse all OBS buckets of the account and select the directory where the data file to be imported is located.
      • Contain Table Header: This setting is enabled by default, indicating that the imported file contains table headers.
        • If the original table contains table headers and this setting is enabled, first rows (table header) of the imported file are used as column names. You do not need to modify the schema information.
        • If the original table does not contain table headers, you need to disable this setting and change column names in Schema to attr_1, attr_2, ..., and attr_n. attr_n is the last column, indicating the prediction column.

      For details about OBS functions, see Object Storage Service Console Operation Guide.

      Data Source (MRS)

      • Cluster Name: All MRS clusters of the current account are automatically displayed. However, streaming clusters do not support data import. Select the required cluster from the drop-down list.
      • File Path: Enter the HDFS file path based on the selected cluster.
      • Contain Table Header: If this setting is enabled, the imported file contains table headers.

      For details about MRS functions, see MapReduce Service User Guide.

      Local file

      Storage Path: Select an OBS path.

      Schema

      Names and types of table columns, which must be the same as those of the imported data. Set the column name based on the imported data and select the column type. For details about the supported types, see Table 4.

      Click Add Schema to add a new record. When creating a dataset, you must specify a schema. Once created, the schema cannot be modified.

      When data is imported from OBS, the schema of the CSV file in the file path is automatically obtained. If the schemas of multiple CSV files are inconsistent, an error will be reported.

      NOTE:

      After you select data from OBS, column names in Schema are automatically displayed, which is the first-row data of the table by default. To ensure the correct prediction code, you need to change column names in Schema to attr_1, attr_2, ..., and attr_n. attr_n is the last column, indicating the prediction column.

      Output Dataset Path

      OBS path for storing table data. The data imported from the data source is stored in this path. The path cannot be the same as the file path in the OBS data source or subdirectories of the file path.

      After a table dataset is created, the following four directories are automatically generated in the storage path:

      • annotation: version publishing directory. Each time a version is published, a subdirectory with the same name as the version is generated in this directory.
      • data: data storage directory. Imported data is stored in this directory.
      • logs: directory for storing logs.
      • temp: temporary working directory.
      Table 4 Schema data types

      Type

      Description

      Storage Space

      Range

      String

      String type

      -

      -

      Short

      Signed integer

      2 bytes

      -32768-32767

      Int

      Signed integer

      4 bytes

      -2147483648 to 2147483647

      Long

      Signed integer

      8 bytes

      -9223372036854775808 to 9223372036854775807

      Double

      Double-precision floating point

      8 bytes

      -

      Float

      Single-precision floating point

      4 bytes

      -

      Byte

      Signed integer

      1 byte

      -128-127

      Date

      Date type in the format of "yyyy-MM-dd", for example, 2014-05-29

      -

      -

      Timestamp

      Timestamp that represents date and time in the format of "yyyy-MM-dd HH:mm:ss"

      -

      -

      Boolean

      Boolean type

      1 byte

      TRUE/FALSE

      When using a CSV file, pay attention to the following:

      • When the data type is set to String, the data in the double quotation marks is regarded as one record by default. Ensure the double quotation marks in the same row are closed. Otherwise, the data will be too large to display.
      • If the number of columns in a row of the CSV file is different from that defined in the schema, the row will be ignored.
  3. After setting the parameters, click Submit.

Modifying the Basic Information of a Dataset

  1. Log in to the ModelArts management console. In the navigation pane on the left, choose Asset Management >Datasets.
  2. In the dataset list, locate the target dataset, and choose More > Modify in the Operation column. Modify the basic information and click OK.
    Table 5 Parameters

    Parameter

    Description

    Name

    Name of a dataset, which must contain 1 to 64 characters long and start with a letter. Only letters, digits, underscores (_), and hyphens (-) are allowed. The name must start with a letter.

    Description

    Brief description of the dataset.