Creating a ModelArts Dataset
Before using ModelArts to prepare data, create a dataset. Then, you can perform operations on the dataset, such as importing data, analyzing data, and labeling data.

Datasets are supported only in the following regions: CN North-Beijing4, CN Southwest-Guiyang1, CN-Hong Kong, AP-Singapore, AP-Bangkok, AP-Jakarta, AF-Johannesburg, LA-Santiago, LA-Sao Paulo1, and LA-Mexico City2.
Dataset Types
ModelArts supports the following types of datasets:
- Images: in .jpg, .png, .jpeg, or .bmp format for image classification, image segmentation, and object detection
- Audio: in .wav format for sound classification, speech labeling, and speech paragraph labeling
- Text: in .txt or .csv format for text classification, named entity recognition, and text triplet labeling
- Video: in .mp4 format for video labeling
- Free format: allows data in any format. Labeling is not available for free format data. The free format applies if labeling is not required or needs to be customized. If your dataset needs to contain data in multiple formats or your data format does not meet the requirements of other types of datasets, you can select a dataset in free format.
Dataset Functions
Different types of datasets support different functions, such as auto labeling and team labeling. For details, see Table 1.
Dataset Type |
Labeling Type |
Creating a Dataset |
Importing Data |
Exporting Data |
Publishing a Dataset |
Modifying a dataset |
Managing Dataset Versions |
Auto labeling |
Team labeling |
Auto Grouping |
Data Feature Engineering |
---|---|---|---|---|---|---|---|---|---|---|---|
Images |
Image classification |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Object detection |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
|
Image segmentation |
Supported |
Supported |
Supported |
Supported |
Supported |
Supported |
- |
- |
Supported |
- |
|
Audio |
Sound classification |
Supported |
Supported |
- |
Supported |
Supported |
Supported |
- |
- |
- |
- |
Speech labeling |
Supported |
Supported |
- |
Supported |
Supported |
Supported |
- |
- |
- |
- |
|
Speech paragraph labeling |
Supported |
Supported |
- |
Supported |
Supported |
Supported |
- |
Supported |
- |
- |
|
Text |
Text classification |
Supported |
Supported |
- |
Supported |
Supported |
Supported |
- |
Supported |
- |
- |
Named entity recognition |
Supported |
Supported |
- |
Supported |
Supported |
Supported |
- |
Supported |
- |
- |
|
Text triplet |
Supported |
Supported |
- |
Supported |
Supported |
Supported |
- |
Supported |
- |
- |
|
Video |
Video |
Supported |
Supported |
- |
Supported |
Supported |
Supported |
- |
- |
- |
- |
Free format |
Free format |
Supported |
- |
_ |
Supported |
Supported |
Supported |
- |
- |
- |
- |
Table |
Table |
Supported |
Supported |
- |
Supported |
Supported |
Supported |
- |
- |
- |
- |
Specifications Restrictions
- The maximum numbers of samples and labels in a single text, video, or audio database other than a table dataset are 1,000,000 and 10,000, respectively.
- The maximum size of a sample in a single text, video, or audio database other than an image dataset is 5 GB.
- The maximum size of an image for object detection, image segmentation, or image classification is 25 MB.
- The manifest file cannot be larger than 5 GB.
- The text file in a line cannot be larger than 100 KB.
- The dataset labeling result file cannot be larger than 100 MB.
Prerequisites
- You have been authorized to access OBS. To do so, go to the ModelArts management console. In the navigation pane on the left, choose Permission Management, and add access authorization using an agency.
- OBS buckets and folders for storing data are available. In addition, the OBS buckets and ModelArts are in the same region. OBS parallel file systems are not supported. Select object storage.
- ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.
Creating a Dataset (Image, Audio, Text, Video, and Free Format)
- Log in to the ModelArts management console. In the navigation pane on the left, choose Asset Management >Datasets.
- Click Create. On the Create Dataset page, create a dataset based on the data type and data labeling requirements. Enter the basic information about the dataset.
Figure 1 Parameters
- Name: Enter a custom dataset name.
- Description: Enter the details about the dataset
- Data Type: Select a data type based on your needs.
- Data Source
- Importing data from OBS
If you have prepared data on OBS, set Data Source to OBS and configure Import Path, Labeling Status, and Output Dataset Path. If Labeling Status is set to Labeled, you need to also configure Labeling Format. The labeling formats of the input data vary depending on the dataset type. For details about the labeling formats supported by ModelArts, see Dataset Functions.
- Importing data from a local path
ModelArts also allows you to upload data from a local path. To do so, set Data Source to Local file, upload data, and configure Labeling Status and Output Dataset Path. Click Upload data to select the local file for uploading. Select a labeling format when the labeling status is Labeled. The labeling formats of the input data vary depending on the dataset type. For details about the labeling formats supported by ModelArts, see Dataset Functions.
Figure 2 Selecting Local file
- Importing data from OBS
- For details about parameters, see Table 2.
Table 2 Dataset parameters Parameter
Description
Import Path
OBS path from which your data is to be imported. This path is used as the data storage path of the dataset.
NOTE:OBS parallel file systems are not supported. Select an OBS bucket.
When you create a dataset, data in the OBS path will be imported to the dataset. If you modify data in OBS, the data in the dataset will be inconsistent with that in OBS. As a result, certain data may be unavailable. If you need to modify data in a dataset, see Import Mode or Importing Data from an OBS Path to ModelArts.
If the numbers of samples and labels of the dataset exceed quotas, importing the samples and labels will fail.
Labeling Status
Labeling status of the selected data, which can be Unlabeled or Labeled.
If you select Labeled, specify a labeling format and ensure the data file complies with format specifications. Otherwise, the import may fail.
Only image (object detection, image classification, and image segmentation), audio (sound classification), and text (text classification) labeling tasks support the import of labeled data.
Output Dataset Path
OBS path where your labeled data is stored.
NOTE:- Ensure that your OBS path name contains letters, digits, and underscores (_) and does not contain special characters, such as ~'@#$%^&*{}[]:;+=<>/ and spaces.
- The dataset output path cannot be the same as the data input path or subdirectory of the data input path.
- It is a good practice to select an empty directory as the dataset output path.
- OBS parallel file systems are not supported. Select an OBS bucket.
Advanced Feature Settings - Import by Tag
This function is disabled by default. You can enable it to import resources by tag.
Import by Tag enables the system to automatically obtain the labels of the current dataset. Click Add Label to add a label. This field is optional. After importing the data, you can add or delete labels during data labeling.
- After setting the parameters, click Submit.
Creating a Dataset (Table)
- Log in to the ModelArts management console. In the navigation pane on the left, choose Asset Management > Datasets.
- Click Create. On the Create Dataset page, create a dataset based on the data type and data labeling requirements. Enter the basic information about the dataset.
Figure 3 Parameters of a table dataset
- Name: Enter a custom dataset name.
- Description: Enter the details about the dataset
- Data Type: Select a data type based on your needs.
- For more details about parameters, see Table 3.
Table 3 Dataset parameters Parameter
Description
Data Source (OBS)
- File Path: Browse all OBS buckets of the account and select the directory where the data file to be imported is located.
- Contain Table Header: This setting is enabled by default, indicating that the imported file contains table headers.
- If the original table contains table headers and this setting is enabled, first rows (table header) of the imported file are used as column names. You do not need to modify the schema information.
- If the original table does not contain table headers, you need to disable this setting and change column names in Schema to attr_1, attr_2, ..., and attr_n. attr_n is the last column, indicating the prediction column.
For details about OBS functions, see Object Storage Service Console Operation Guide.
Data Source (MRS)
- Cluster Name: All MRS clusters of the current account are automatically displayed. However, streaming clusters do not support data import. Select the required cluster from the drop-down list.
- File Path: Enter the HDFS file path based on the selected cluster.
- Contain Table Header: If this setting is enabled, the imported file contains table headers.
For details about MRS functions, see MapReduce Service User Guide.
Local file
Storage Path: Select an OBS path.
Schema
Names and types of table columns, which must be the same as those of the imported data. Set the column name based on the imported data and select the column type. For details about the supported types, see Table 4.
Click Add Schema to add a new record. When creating a dataset, you must specify a schema. Once created, the schema cannot be modified.
When data is imported from OBS, the schema of the CSV file in the file path is automatically obtained. If the schemas of multiple CSV files are inconsistent, an error will be reported.
NOTE:After you select data from OBS, column names in Schema are automatically displayed, which is the first-row data of the table by default. To ensure the correct prediction code, you need to change column names in Schema to attr_1, attr_2, ..., and attr_n. attr_n is the last column, indicating the prediction column.
Output Dataset Path
OBS path for storing table data. The data imported from the data source is stored in this path. The path cannot be the same as the file path in the OBS data source or subdirectories of the file path.
After a table dataset is created, the following four directories are automatically generated in the storage path:
- annotation: version publishing directory. Each time a version is published, a subdirectory with the same name as the version is generated in this directory.
- data: data storage directory. Imported data is stored in this directory.
- logs: directory for storing logs.
- temp: temporary working directory.
Table 4 Schema data types Type
Description
Storage Space
Range
String
String type
-
-
Short
Signed integer
2 bytes
-32768-32767
Int
Signed integer
4 bytes
-2147483648 to 2147483647
Long
Signed integer
8 bytes
-9223372036854775808 to 9223372036854775807
Double
Double-precision floating point
8 bytes
-
Float
Single-precision floating point
4 bytes
-
Byte
Signed integer
1 byte
-128-127
Date
Date type in the format of "yyyy-MM-dd", for example, 2014-05-29
-
-
Timestamp
Timestamp that represents date and time in the format of "yyyy-MM-dd HH:mm:ss"
-
-
Boolean
Boolean type
1 byte
TRUE/FALSE
When using a CSV file, pay attention to the following:
- When the data type is set to String, the data in the double quotation marks is regarded as one record by default. Ensure the double quotation marks in the same row are closed. Otherwise, the data will be too large to display.
- If the number of columns in a row of the CSV file is different from that defined in the schema, the row will be ignored.
- After setting the parameters, click Submit.
Modifying the Basic Information of a Dataset
- Log in to the ModelArts management console. In the navigation pane on the left, choose Asset Management >Datasets.
- In the dataset list, locate the target dataset, and choose More > Modify in the Operation column. Modify the basic information and click OK.
Table 5 Parameters Parameter
Description
Name
Name of a dataset, which must contain 1 to 64 characters long and start with a letter. Only letters, digits, underscores (_), and hyphens (-) are allowed. The name must start with a letter.
Description
Brief description of the dataset.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot