Creating a Dataset

Create a dataset whose data can be imported from OBS.

create_dataset(session, dataset_name=None, data_type=None, data_sources=None, work_path=None, dataset_type=None, **kwargs)

Use either of the following methods to create a dataset:

Create a dataset based on the labeling type. One dataset supports only one labeling task type.

create_dataset(session,dataset_name=None, dataset_type=None, data_sources=None, work_path=None, **kwargs)

Create a dataset based on the data type. You can create different types of labeling tasks on the same dataset. For example, create image classification and object detection labeling tasks on an image dataset.
```
create_dataset(session,dataset_name=None, data_type=None, data_sources=None, work_path=None, **kwargs)
```

You are advised to create a dataset based on the data type. Creating a dataset based on the labeling type will be terminated.

Sample Code

Example 1: Create an image dataset based on the data type.

from modelarts.session import Session
from modelarts.dataset import Dataset

session = Session()

dataset_name = "dataset-image"  # Dataset name
data_type = "IMAGE"             # Dataset type, which is an image dataset
data_sources = dict()           # Dataset data source
data_sources["type"] = 0        # Data source type. Value 0 indicates OBS.
data_sources["path"] = "/obs-gaia-test/data/image/image-classification/" # Path for storing data in OBS
work_path = dict()              # Work directory of the dataset
work_path['type'] = 0           # Working directory type of the dataset. Value 0 indicates OBS.
work_path['path'] = "/obs-gaia-test/data/output/work_path/"  # Path for the working directory of the dataset in OBS
create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, data_type=data_type,
                                             data_sources=data_sources, work_path=work_path)

Example 2: Create an image dataset based on the data types (labels imported).

from modelarts.session import Session
from modelarts.dataset import Dataset

session = Session()

dataset_name = "dataset-image-with-annotations"
data_type = "IMAGE"
data_sources = dict()
data_sources["type"] = 0
data_sources["path"] = "/obs-gaia-test/data/image/image-classification/"
annotation_config = dict()      # Labeling format of the source data
annotation_config['scene'] = "image_classification" # Image classification labeling
annotation_config['format_name'] = "ModelArts image classification 1.0" # Labeling format of ModelArts image classification 1.0
data_sources['annotation_config'] = annotation_config
work_path = dict()
work_path['type'] = 0
work_path['path'] = "/obs-gaia-test/data/output/work_path/"
create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, data_type=data_type,
                                             data_sources=data_sources, work_path=work_path)

Example 3: Create a table dataset based on the data type.

from modelarts.session import Session
from modelarts.dataset import Dataset

session = Session()

dataset_name = "dataset-table"
data_type = "TABLE"
data_sources = dict()
data_sources["type"] = 0
data_sources["path"] = "/obs-gaia-test/data/table/table0/"
data_sources['with_column_header'] = True
work_path = dict()
work_path['type'] = 0
work_path['path'] = "/obs-gaia-test/data/output/work_path/"
# Schema information of the table data needs to be specified for the table dataset.
schema0 = dict()
schema0['schema_id'] = 0
schema0['name'] = "name"
schema0['type'] = "STRING"
schema1 = dict()
schema1['schema_id'] = 1
schema1['name'] = "age"
schema1['type'] = "STRING"
schema2 = dict()
schema2['schema_id'] = 2
schema2['name'] = "label"
schema2['type'] = "STRING"
schemas = []
schemas.append(schema0)
schemas.append(schema1)
schemas.append(schema2)
create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, data_type=data_type,
                                             data_sources=data_sources, work_path=work_path, schema=schemas)

Example 4: Create an image classification dataset based on the labeling type.

from modelarts.session import Session
from modelarts.dataset import Dataset

session = Session()

dataset_name = "dataset-image-classification"
dataset_type = 0   # Dataset labeling type. Value 0 indicates image classification.
data_sources = dict()
data_sources["path"] = "/obs-gaia-test/data/image/image-classification/"
data_sources["type"] = "0"
work_path = dict()
work_path['type'] = 0
work_path['path'] = "/obs-gaia-test/data/output/work_path/"
create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, dataset_type=dataset_type, data_sources=data_sources, work_path=work_path)

Example 5: Create a text triplet dataset based on the labeling type.

dataset_name = "dataset-text-triplet"
dataset_type = 102   # Dataset labeling type. Value 102 indicates text triplet.
data_sources = dict()
data_sources['type'] = 0
data_sources['path'] = "/obs-gaia-test/data/text/text-classification/"
work_path = dict()
work_path['type'] = 0
work_path['path'] = "/obs-gaia-test/data/output/work_path/"

# Create a dataset of the text triplet labeling type with labels imported.
label_entity1 = dict()    # Label object
label_entity1['name'] = "Disease"    # Label name
label_entity1['type'] = 101     # Label type. Value 101 indicates an entity.
label_entity2 = dict()
label_entity2['name'] = "Disease alias"
label_entity2['type'] = 101
label_relation1 = dict()
label_relation1['name'] = "Also called"
label_relation1['type'] = 102    # Label type. Value 102 indicates relational.
property = dict()    # For a relational label, the start entity label and end entity label must be specified in label properties.
property['@modelarts:from_type'] = "Disease"    # Start entity label
property['@modelarts:to_type'] = "Disease alias"    # End entity label
label_relation1['property'] = property
labels = []
labels.append(label_entity1)
labels.append(label_entity2)
labels.append(label_relation1)
create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, dataset_type=dataset_type, data_sources=data_sources, work_path=work_path, labels=labels)

Example 6: Create a table dataset based on the labeling type.

dataset_name = "dataset-table"
dataset_type = 400    # Dataset labeling type. Value 400 indicates a table dataset.
data_sources = dict()
data_sources['type'] = 0
data_sources['path'] = "/obs-gaia-test/data/table/table0/"
data_sources['with_column_header'] = True    # Whether the table data contains a table header
work_path = dict()
work_path['type'] = 0
work_path['path'] = "/obs-gaia-test/data/output/work_path/"

# The table header of the table data needs to be imported to the table dataset.
schema0 = dict()    # Table header
schema0['schema_id'] = 0    # Header of the first column
schema0['name'] = "name"    # Table header name, which is name in the column
schema0['type'] = "STRING"    # Data type of the table header, indicating a character string
schema1 = dict()
schema1['schema_id'] = 1
schema1['name'] = "age"
schema1['type'] = "STRING"
schema2 = dict()
schema2['schema_id'] = 2
schema2['name'] = "label"
schema2['type'] = "STRING"
schemas = []
schemas.append(schema0)
schemas.append(schema1)
schemas.append(schema2)
create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, dataset_type=dataset_type, data_sources=data_sources, work_path=work_path, schema=schemas)

Parameters

**Table 1** Request parameters
Name	Mandatory	Type	Description
session	Yes	Object	Session object. For details about the initialization method, see Session Authentication.
dataset_name	Yes	String	Dataset name
data_type	No	String	Data type of a dataset. Either data_type or dataset_type must be specified. data_type is recommended. The options are as follows: IMAGE: image TEXT: text AUDIO: audio TABLE: table VIDEO: video PLAIN: custom format
dataset_type	No	Integer	Obtain a dataset list based on the dataset type. Either data_type or dataset_type must be specified. The options are as follows: 0: image classification 1: object detection 3: image segmentation 100: text classification 101: named entity recognition 102: text triplet 200: sound classification 201: speech content 202: speech paragraph labeling 400: table dataset 600: video labeling 900: custom format
data_sources	Yes	Table 2	Input dataset path, which is used to synchronize source data (such as images, text files, and audio files) in the directory and its subdirectories to the dataset. For a table dataset, this parameter indicates the import directory. The work directory of a table dataset cannot be an OBS path in a KMS-encrypted bucket.
work_path	Yes	Table 6	Output dataset path, which is used to store output files such as label files.
labels	No	List of Table 7	Dataset labels. This parameter must be imported when you create a text triplet dataset.
schema	No	List of Table 9	Schema list, which is used to specify the name and type of the table header of a table dataset
description	No	String	Dataset description consisting of 0 to 256 characters without special characters (^!<>=&"'). The parameter is left blank by default.

**Table 2** **DataSource** parameters
Name	Mandatory	Type	Description
type	Yes	Integer	Data type. The options are as follows: 0: OBS bucket (default value) 5: Dataset downloaded from AI Gallery
path	Yes	String	Data source path Newline characters (\n), carriage return characters (\r), and tab characters (\t) are not allowed.
content_info	No	Table 3	Dataset asset downloaded from the AI Gallery
annotation_config	No	Table 4	Data labeling format, which can be: Image classification Object detection Text classification Sound classification
with_column_header	No	Boolean	Whether the first row of a table is the table header. This parameter is mandatory for table datasets. True: The first row of a table is used as the table header. False: The first row of a table is not used as the table header, but only as sample data.

**Table 3** **ContentInfo** parameters
Name	Mandatory	Type	Description
content_id	Yes	String	Dataset asset ID in AI Gallery
version_id	Yes	String	Dataset asset version ID in AI Gallery

**Table 4** **AnnotationConfig** parameters
Name	Mandatory	Type	Description
scene	Yes	String	Supported labeling scenarios. The options are as follows: image_classification object_detection text_classification audio_classification
format_name	Yes	String	Labeling format in different scenarios. The options are as follows: image_classification ModelArts imageNet 1.0 ModelArts image classification 1.0 object_detection ModelArts PASCAL VOC 1.0 YOLO text_classification ModelArts text classification 1.0 ModelArts text classification combine 1.0 audio_classification ModelArts audio classification dir 1.0
parameters	No	Table 5	Advanced labeling format parameters, such as the sample separator

**Table 5** **AnnotationConfigParam** parameters
Name	Mandatory	Type	Description
included_labels	No	List of Table 7	Import only samples with specified labels.
sample_label_separator	No	String	Separator between text and labels. The separator contains only one character, which must be a letter, digit, or one of the following characters (@#¥%^&*_=\|?/':.;,). The separator must be escaped.
label_separator	No	String	Separator between labels. The separator contains only one character, which must be a letter, digit, or one of the following characters (@#¥%^&*_=\|?/':.;,). The separator must be escaped.
difficult_only	No	Boolean	Whether to import only hard examples.

**Table 6** **WorkPath** parameters
Parameter	Mandatory	Type	Description
type	Yes	Integer	Data type. The options are as follows: 0: OBS bucket (default value)
path	Yes	String	Output dataset path, which is used to store output files such as label files. The format is "/Bucket name/File path", for example, /obs-bucket/flower/rose/ (directory used as the path). A bucket cannot be used as a path. The output path must be different from the input path and its subdirectories. The parameter consists of 3 to 700 characters. Newline characters (\n), carriage return characters (\r), and tab characters (\t) are not allowed.

**Table 7** **Label** parameters
Parameter	Mandatory	Type	Description
name	Yes	String	Label name
type	Yes	Integer	Label type. The options are as follows: 0: image classification 1: object detection 3: image segmentation 100: text classification 101: named entity 102: text triplet relationship 200: sound classification 201: speech content 202: speech paragraph labeling 600: video labeling
property	No	Table 8	Basic attribute key-value pair of a label, such as color

**Table 8** **LabelProperty** parameters
Parameter	Mandatory	Type	Description
@modelarts:color	No	String	(Built-in attribute) Label color, which is a hexadecimal code of the color. By default, this parameter is left blank. For example, #FFFFF0.
@modelarts:from_type	No	String	(Built-in attribute) Type of the head entity in a triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is only used in text triplet datasets.
@modelarts:to_type	No	String	(Built-in attribute) Type of the tail entity in a triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is only used in text triplet datasets.

**Table 9** **Schema** parameters
Parameter	Mandatory	Type	Description
schema_id	No	Integer	Schema ID
name	No	String	Schema name
type	No	String	Schema value type. The options are as follows: STRING SHORT INT LONG DOUBLE FLOAT BYTE DATE TIMESTAMP BOOLEAN
description	No	String	Schema description