Creating a ModelArts Dataset

Before using ModelArts to prepare data, create a dataset. Then, you can perform operations on the dataset, such as importing data, analyzing data, and labeling data.

Datasets are supported only in the following regions: CN North-Beijing4, CN Southwest-Guiyang1, CN-Hong Kong, AP-Singapore, AP-Bangkok, AP-Jakarta, AF-Johannesburg, LA-Santiago, LA-Sao Paulo1, and LA-Mexico City2.

Dataset Types

ModelArts supports the following types of datasets:

Images: in .jpg, .png, .jpeg, or .bmp format for image classification, image segmentation, and object detection
Audio: in .wav format for sound classification, speech labeling, and speech paragraph labeling
Text: in .txt or .csv format for text classification, named entity recognition, and text triplet labeling
Video: in .mp4 format for video labeling
Free format: allows data in any format. Labeling is not available for free format data. The free format applies if labeling is not required or needs to be customized. If your dataset needs to contain data in multiple formats or your data format does not meet the requirements of other types of datasets, you can select a dataset in free format.

Table
Table: applies to structured data processing such as tables. The file format can be CSV. Tables cannot be labeled but you can preview up to 100 data records in a table.

Dataset Functions

Different types of datasets support different functions, such as auto labeling and team labeling. For details, see Table 1.

**Table 1** Functions supported by different types of datasets
Dataset Type	Labeling Type	Creating a Dataset	Importing Data	Exporting Data	Publishing a Dataset	Modifying a dataset	Managing Dataset Versions	Auto labeling	Team labeling	Auto Grouping	Data Feature Engineering
Images	Image classification	Supported	Supported	Supported	Supported	Supported	Supported	Supported	Supported	Supported	Supported
	Object detection	Supported	Supported	Supported	Supported	Supported	Supported	Supported	Supported	Supported	Supported
	Image segmentation	Supported	Supported	Supported	Supported	Supported	Supported	-	-	Supported	-
Audio	Sound classification	Supported	Supported	-	Supported	Supported	Supported	-	-	-	-
	Speech labeling	Supported	Supported	-	Supported	Supported	Supported	-	-	-	-
	Speech paragraph labeling	Supported	Supported	-	Supported	Supported	Supported	-	Supported	-	-
Text	Text classification	Supported	Supported	-	Supported	Supported	Supported	-	Supported	-	-
	Named entity recognition	Supported	Supported	-	Supported	Supported	Supported	-	Supported	-	-
	Text triplet	Supported	Supported	-	Supported	Supported	Supported	-	Supported	-	-
Video	Video	Supported	Supported	-	Supported	Supported	Supported	-	-	-	-
Free format	Free format	Supported	-	_	Supported	Supported	Supported	-	-	-	-
Table	Table	Supported	Supported	-	Supported	Supported	Supported	-	-	-	-

Specifications Restrictions

The maximum numbers of samples and labels in a single text, video, or audio database other than a table dataset are 1,000,000 and 10,000, respectively.
The maximum size of a sample in a single text, video, or audio database other than an image dataset is 5 GB.
The maximum size of an image for object detection, image segmentation, or image classification is 25 MB.
The manifest file cannot be larger than 5 GB.
The text file in a line cannot be larger than 100 KB.
The dataset labeling result file cannot be larger than 100 MB.

Prerequisites

You have been authorized to access OBS. To do so, go to the ModelArts management console. In the navigation pane on the left, choose Permission Management, and add access authorization using an agency.
OBS buckets and folders for storing data are available. In addition, the OBS buckets and ModelArts are in the same region. OBS parallel file systems are not supported. Select object storage.
ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.

Creating a Dataset (Image, Audio, Text, Video, and Free Format)

Click Create. On the Create Dataset page, create a dataset based on the data type and data labeling requirements. Enter the basic information about the dataset.

Figure 1 Parameters

Name: Enter a custom dataset name.
Description: Enter the details about the dataset
Data Type: Select a data type based on your needs.
Data Source
1. Importing data from OBS
  If you have prepared data on OBS, set Data Source to OBS and configure Import Path, Labeling Status, and Output Dataset Path. If Labeling Status is set to Labeled, you need to also configure Labeling Format. The labeling formats of the input data vary depending on the dataset type. For details about the labeling formats supported by ModelArts, see Dataset Functions.
2. Importing data from a local path
  ModelArts also allows you to upload data from a local path. To do so, set Data Source to Local file, upload data, and configure Labeling Status and Output Dataset Path. Click Upload data to select the local file for uploading. Select a labeling format when the labeling status is Labeled. The labeling formats of the input data vary depending on the dataset type. For details about the labeling formats supported by ModelArts, see Dataset Functions.
  
  Figure 2 Selecting Local file

For details about parameters, see Table 2.

**Table 2** Dataset parameters
Parameter	Description
Import Path	OBS path from which your data is to be imported. This path is used as the data storage path of the dataset. NOTE: OBS parallel file systems are not supported. Select an OBS bucket. When you create a dataset, data in the OBS path will be imported to the dataset. If you modify data in OBS, the data in the dataset will be inconsistent with that in OBS. As a result, certain data may be unavailable. If you need to modify data in a dataset, see Import Mode or Importing Data from an OBS Path to ModelArts. If the numbers of samples and labels of the dataset exceed quotas, importing the samples and labels will fail.
Labeling Status	Labeling status of the selected data, which can be Unlabeled or Labeled. If you select Labeled, specify a labeling format and ensure the data file complies with format specifications. Otherwise, the import may fail. Only image (object detection, image classification, and image segmentation), audio (sound classification), and text (text classification) labeling tasks support the import of labeled data.
Output Dataset Path	OBS path where your labeled data is stored. NOTE: Ensure that your OBS path name contains letters, digits, and underscores (_) and does not contain special characters, such as ~'@#$%^&*{}[]:;+=<>/ and spaces. The dataset output path cannot be the same as the data input path or subdirectory of the data input path. It is a good practice to select an empty directory as the dataset output path. OBS parallel file systems are not supported. Select an OBS bucket.
Advanced Feature Settings - Import by Tag	This function is disabled by default. You can enable it to import resources by tag. Import by Tag enables the system to automatically obtain the labels of the current dataset. Click Add Label to add a label. This field is optional. After importing the data, you can add or delete labels during data labeling.

After setting the parameters, click Submit.

Creating a Dataset (Table)

Click Create. On the Create Dataset page, create a dataset based on the data type and data labeling requirements. Enter the basic information about the dataset.

Figure 3 Parameters of a table dataset

Name: Enter a custom dataset name.
Description: Enter the details about the dataset
Data Type: Select a data type based on your needs.

For more details about parameters, see Table 3.

**Table 3** Dataset parameters
Parameter	Description
Data Source (OBS)	File Path: Browse all OBS buckets of the account and select the directory where the data file to be imported is located. Contain Table Header: This setting is enabled by default, indicating that the imported file contains table headers. If the original table contains table headers and this setting is enabled, first rows (table header) of the imported file are used as column names. You do not need to modify the schema information. If the original table does not contain table headers, you need to disable this setting and change column names in Schema to attr_1, attr_2, ..., and attr_n. attr_n is the last column, indicating the prediction column. For details about OBS functions, see Object Storage Service Console Operation Guide.
Data Source (MRS)	Cluster Name: All MRS clusters of the current account are automatically displayed. However, streaming clusters do not support data import. Select the required cluster from the drop-down list. File Path: Enter the HDFS file path based on the selected cluster. Contain Table Header: If this setting is enabled, the imported file contains table headers. For details about MRS functions, see MapReduce Service User Guide.
Local file	Storage Path: Select an OBS path.
Schema	Names and types of table columns, which must be the same as those of the imported data. Set the column name based on the imported data and select the column type. For details about the supported types, see Table 4. Click Add Schema to add a new record. When creating a dataset, you must specify a schema. Once created, the schema cannot be modified. When data is imported from OBS, the schema of the CSV file in the file path is automatically obtained. If the schemas of multiple CSV files are inconsistent, an error will be reported. NOTE: After you select data from OBS, column names in Schema are automatically displayed, which is the first-row data of the table by default. To ensure the correct prediction code, you need to change column names in Schema to attr_1, attr_2, ..., and attr_n. attr_n is the last column, indicating the prediction column.
Output Dataset Path	OBS path for storing table data. The data imported from the data source is stored in this path. The path cannot be the same as the file path in the OBS data source or subdirectories of the file path. After a table dataset is created, the following four directories are automatically generated in the storage path: annotation: version publishing directory. Each time a version is published, a subdirectory with the same name as the version is generated in this directory. data: data storage directory. Imported data is stored in this directory. logs: directory for storing logs. temp: temporary working directory.

**Table 4** Schema data types
Type	Description	Storage Space	Range
String	String type	-	-
Short	Signed integer	2 bytes	-32768-32767
Int	Signed integer	4 bytes	-2147483648 to 2147483647
Long	Signed integer	8 bytes	-9223372036854775808 to 9223372036854775807
Double	Double-precision floating point	8 bytes	-
Float	Single-precision floating point	4 bytes	-
Byte	Signed integer	1 byte	-128-127
Date	Date type in the format of "yyyy-MM-dd", for example, 2014-05-29	-	-
Timestamp	Timestamp that represents date and time in the format of "yyyy-MM-dd HH:mm:ss"	-	-
Boolean	Boolean type	1 byte	TRUE/FALSE

When using a CSV file, pay attention to the following:

When the data type is set to String, the data in the double quotation marks is regarded as one record by default. Ensure the double quotation marks in the same row are closed. Otherwise, the data will be too large to display.
If the number of columns in a row of the CSV file is different from that defined in the schema, the row will be ignored.

After setting the parameters, click Submit.

Modifying the Basic Information of a Dataset

In the dataset list, locate the target dataset, and choose More > Modify in the Operation column. Modify the basic information and click OK.

**Table 5** Parameters
Parameter	Description
Name	Name of a dataset, which must contain 1 to 64 characters long and start with a letter. Only letters, digits, underscores (_), and hyphens (-) are allowed. The name must start with a letter.
Description	Brief description of the dataset.