Creating a Dataset
To manage data using ModelArts, you need to create a dataset first. Then you can perform operations on the dataset, such as labeling data, importing data, and publishing the dataset.
Prerequisites
- Before using the data management function, you need permissions to access OBS. This function cannot be used if you are not authorized to access OBS. Before using the data management function, go to the Settings page and complete access authorization using an agency.
- You have created OBS buckets and folders for storing data. In addition, the OBS buckets and ModelArts are in the same region.
- You have uploaded data to be used to OBS. For details, see How Do I Upload Data to OBS?.
Procedure
- Log in to the ModelArts management console. In the left navigation pane, choose Data Management > Datasets. The Datasets page is displayed.
- Click Create Dataset. On the Create Dataset page, create datasets of different types based on the data type and data labeling requirements.
- Enter basic information, the name and description of the dataset.
Figure 1 Basic information about a dataset
- Select a labeling scene and type as required. For details about the types supported by ModelArts, see Dataset Types.
Figure 2 Selecting a labeling scene and type
- Set the parameters based on the dataset type. For details, see the parameter description of the following dataset types:
- Click Create in the lower right corner of the page.
After the dataset is created, the dataset management page is displayed. You can perform the following operations on the dataset: label data, publish dataset versions, manage dataset versions, modify the dataset, import data, and delete the dataset. For details about the operations supported by different types of datasets, see Functions Supported by Different Types of Datasets.
- Enter basic information, the name and description of the dataset.
Images (Image Classification and Object Detection)
|
Parameter |
Description |
|---|---|
|
Input Dataset Path |
Select the OBS path to the input dataset. |
|
Output Dataset Path |
Select the OBS path to the output dataset.
NOTE:
The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the Output Dataset Path. |
|
Label Set |
|
|
Team Labeling |
Enable or disable team labeling. To enable the team labeling function, you need to enter the name and type of the team labeling task, and select the labeling team and team members. For details about the parameter settings, see Creating Team Labeling Tasks. Before enabling the team labeling function, ensure that you have added a team and members on the Labeling Teams page. If no labeling team is available, click the link on the page to go to the Labeling Teams page, and add your team and members. For details, see Team Labeling Overview. After a dataset is created with team labeling enabled, you can view the Team Labeling mark in Labeling Type. |
Audio (Sound Classification, Speech Labeling, and Speech Paragraph Labeling)
|
Parameter |
Description |
|---|---|
|
Input Dataset Path |
Select the OBS path to the input dataset. |
|
Output Dataset Path |
Select the OBS path to the output dataset.
NOTE:
The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the Output Dataset Path. |
|
Label Set (Sound Classification) |
You need to set labels only for datasets of the sound classification type.
|
|
Label Management (Speech Paragraph Labeling) |
Only datasets for speech paragraph labeling support multiple labels.
|
|
Speech Labeling (Speech Paragraph Labeling) |
This function is available only for dataset for speech paragraph labeling. By default, it is disabled. If this function is enabled, you can label speech content. |
Text (Text Classification, Named Entity Recognition, and Text Triplet)
|
Parameter |
Description |
|---|---|
|
Input Dataset Path |
Select the OBS path to the input dataset. |
|
Output Dataset Path |
Select the OBS path to the output dataset.
NOTE:
The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the Output Dataset Path. |
|
Label Set (for text classification and named entity recognition) |
|
|
Label Set (for text triplet) |
For datasets of the text triplet type, you need to set entity labels and relationship labels.
|
|
Team Labeling |
Enable or disable team labeling. To enable the team labeling function, you need to enter the name and type of the team labeling task, and select the labeling team and team members. For details about the parameter settings, see Creating Team Labeling Tasks. Before enabling the team labeling function, ensure that you have added a team and members on the Labeling Teams page. If no labeling team is available, click the link on the page to go to the Labeling Teams page, and add your team and members. For details, see Team Labeling Overview. After a dataset is created with team labeling enabled, you can view the Team Labeling mark in Labeling Type. |
Table
|
Parameter |
Description |
|---|---|
|
Storage Path |
Select the OBS path for storing table data. The data imported from the data source is stored in this path. The path cannot be the same as or a subdirectory of the file path in the OBS data source. After a table dataset is created, the following four directories are automatically generated in the storage path:
|
|
Import |
If you have stored table data on other cloud services, you can enable this function to import data stored on OBS, DLI, or MRS. |
|
Data Source (OBS) |
For details about OBS functions, see the Object Storage Service Console Operation Guide. |
|
Data Source (DWS) |
For details about DWS functions, see the Data Warehouse Service User Guide.
NOTE:
To import data from DWS, you need to use DLI functions. If you do not have the permission to access DLI, create a DLI agency as prompted. |
|
Data Source (DLI) |
For details about DLI functions, see the Data Lake Insight User Guide. |
|
Data Source (MRS) |
For details about MRS functions, see the MapReduce Service User Guide. |
|
Schema |
Names and types of table columns, which must be the same as those of the imported data. Set the column name based on the imported data and select the column type. For details about the supported types, see Table 4. Click Add Schema to add a new record. When creating a dataset, you must specify a schema. Once created, the schema cannot be modified. When data is imported from OBS, the schema of the CSV file in the file path is automatically obtained. If the schemas of multiple CSV files are inconsistent, an error is reported. |
|
Type |
Description |
Storage Space |
Range |
|---|---|---|---|
|
String |
String |
- |
- |
|
Short |
Signed integer |
2 bytes |
-32768-32767 |
|
Int |
Signed integer |
4 bytes |
–2147483648 to 2147483647 |
|
Long |
Signed integer |
8 bytes |
–9223372036854775808 to 9223372036854775807 |
|
Double |
Double-precision floating point |
8 bytes |
- |
|
Float |
Single-precision floating point |
4 bytes |
- |
|
Byte |
Signed integer |
1 byte |
-128-127 |
|
Date |
Date type in the format of yyyy-MM-dd, for example, 2014-05-29 |
- |
- |
|
Timestamp |
Timestamp that represents date and time. Format: yyyy-MM-dd HH:mm:ss |
- |
- |
|
Boolean |
Boolean |
1 byte |
TRUE/FALSE |
Video
|
Parameter |
Description |
|---|---|
|
Input Dataset Path |
Select the OBS path to the input dataset. |
|
Output Dataset Path |
Select the OBS path to the output dataset.
NOTE:
The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the Output Dataset Path. |
|
Label Set |
|
Other (Free Format)
|
Parameter |
Description |
|---|---|
|
Input Dataset Path |
Select the OBS path to the input dataset. |
|
Output Dataset Path |
Select the OBS path to the output dataset.
NOTE:
The output dataset path cannot be the same as the input dataset path or cannot be the subdirectory of the input dataset path. Select an empty directory as the Output Dataset Path. |
Last Article: Data Management Overview
Next Article: Labeling Data


Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.