Creating a Dataset
Create a dataset whose data can be imported from OBS.
create_dataset(session, dataset_name=None, data_type=None, data_sources=None, work_path=None, dataset_type=None, **kwargs)
Use either of the following methods to create a dataset:
- Create a dataset based on the labeling type. One dataset supports only one labeling task type.
create_dataset(session,dataset_name=None, dataset_type=None, data_sources=None, work_path=None, **kwargs)
- Create a dataset based on the data type. You can create different types of labeling tasks on the same dataset. For example, create image classification and object detection labeling tasks on an image dataset.
create_dataset(session,dataset_name=None, data_type=None, data_sources=None, work_path=None, **kwargs)
You are advised to create a dataset based on the data type. Creating a dataset based on the labeling type will be terminated.
Sample Code
- Example 1: Create an image dataset based on the data type.
from modelarts.session import Session from modelarts.dataset import Dataset session = Session() dataset_name = "dataset-image" # Dataset name data_type = "IMAGE" # Dataset type, which is an image dataset data_sources = dict() # Dataset data source data_sources["type"] = 0 # Data source type. Value 0 indicates OBS. data_sources["path"] = "/obs-gaia-test/data/image/image-classification/" # Path for storing data in OBS work_path = dict() # Work directory of the dataset work_path['type'] = 0 # Working directory type of the dataset. Value 0 indicates OBS. work_path['path'] = "/obs-gaia-test/data/output/work_path/" # Path for the working directory of the dataset in OBS create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, data_type=data_type, data_sources=data_sources, work_path=work_path)
- Example 2: Create an image dataset based on the data types (labels imported).
from modelarts.session import Session from modelarts.dataset import Dataset session = Session() dataset_name = "dataset-image-with-annotations" data_type = "IMAGE" data_sources = dict() data_sources["type"] = 0 data_sources["path"] = "/obs-gaia-test/data/image/image-classification/" annotation_config = dict() # Labeling format of the source data annotation_config['scene'] = "image_classification" # Image classification labeling annotation_config['format_name'] = "ModelArts image classification 1.0" # Labeling format of ModelArts image classification 1.0 data_sources['annotation_config'] = annotation_config work_path = dict() work_path['type'] = 0 work_path['path'] = "/obs-gaia-test/data/output/work_path/" create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, data_type=data_type, data_sources=data_sources, work_path=work_path)
- Example 3: Create a table dataset based on the data type.
from modelarts.session import Session from modelarts.dataset import Dataset session = Session() dataset_name = "dataset-table" data_type = "TABLE" data_sources = dict() data_sources["type"] = 0 data_sources["path"] = "/obs-gaia-test/data/table/table0/" data_sources['with_column_header'] = True work_path = dict() work_path['type'] = 0 work_path['path'] = "/obs-gaia-test/data/output/work_path/" # Schema information of the table data needs to be specified for the table dataset. schema0 = dict() schema0['schema_id'] = 0 schema0['name'] = "name" schema0['type'] = "STRING" schema1 = dict() schema1['schema_id'] = 1 schema1['name'] = "age" schema1['type'] = "STRING" schema2 = dict() schema2['schema_id'] = 2 schema2['name'] = "label" schema2['type'] = "STRING" schemas = [] schemas.append(schema0) schemas.append(schema1) schemas.append(schema2) create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, data_type=data_type, data_sources=data_sources, work_path=work_path, schema=schemas)
- Example 4: Create an image classification dataset based on the labeling type.
from modelarts.session import Session from modelarts.dataset import Dataset session = Session() dataset_name = "dataset-image-classification" dataset_type = 0 # Dataset labeling type. Value 0 indicates image classification. data_sources = dict() data_sources["path"] = "/obs-gaia-test/data/image/image-classification/" data_sources["type"] = "0" work_path = dict() work_path['type'] = 0 work_path['path'] = "/obs-gaia-test/data/output/work_path/" create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, dataset_type=dataset_type, data_sources=data_sources, work_path=work_path)
- Example 5: Create a text triplet dataset based on the labeling type.
dataset_name = "dataset-text-triplet" dataset_type = 102 # Dataset labeling type. Value 102 indicates text triplet. data_sources = dict() data_sources['type'] = 0 data_sources['path'] = "/obs-gaia-test/data/text/text-classification/" work_path = dict() work_path['type'] = 0 work_path['path'] = "/obs-gaia-test/data/output/work_path/" # Create a dataset of the text triplet labeling type with labels imported. label_entity1 = dict() # Label object label_entity1['name'] = "Disease" # Label name label_entity1['type'] = 101 # Label type. Value 101 indicates an entity. label_entity2 = dict() label_entity2['name'] = "Disease alias" label_entity2['type'] = 101 label_relation1 = dict() label_relation1['name'] = "Also called" label_relation1['type'] = 102 # Label type. Value 102 indicates relational. property = dict() # For a relational label, the start entity label and end entity label must be specified in label properties. property['@modelarts:from_type'] = "Disease" # Start entity label property['@modelarts:to_type'] = "Disease alias" # End entity label label_relation1['property'] = property labels = [] labels.append(label_entity1) labels.append(label_entity2) labels.append(label_relation1) create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, dataset_type=dataset_type, data_sources=data_sources, work_path=work_path, labels=labels)
- Example 6: Create a table dataset based on the labeling type.
dataset_name = "dataset-table" dataset_type = 400 # Dataset labeling type. Value 400 indicates a table dataset. data_sources = dict() data_sources['type'] = 0 data_sources['path'] = "/obs-gaia-test/data/table/table0/" data_sources['with_column_header'] = True # Whether the table data contains a table header work_path = dict() work_path['type'] = 0 work_path['path'] = "/obs-gaia-test/data/output/work_path/" # The table header of the table data needs to be imported to the table dataset. schema0 = dict() # Table header schema0['schema_id'] = 0 # Header of the first column schema0['name'] = "name" # Table header name, which is name in the column schema0['type'] = "STRING" # Data type of the table header, indicating a character string schema1 = dict() schema1['schema_id'] = 1 schema1['name'] = "age" schema1['type'] = "STRING" schema2 = dict() schema2['schema_id'] = 2 schema2['name'] = "label" schema2['type'] = "STRING" schemas = [] schemas.append(schema0) schemas.append(schema1) schemas.append(schema2) create_dataset_resp = Dataset.create_dataset(session, dataset_name=dataset_name, dataset_type=dataset_type, data_sources=data_sources, work_path=work_path, schema=schemas)
Parameters
Name |
Mandatory |
Type |
Description |
---|---|---|---|
session |
Yes |
Object |
Session object. For details about the initialization method, see Session Authentication. |
dataset_name |
Yes |
String |
Dataset name |
data_type |
No |
String |
Data type of a dataset. Either data_type or dataset_type must be specified. data_type is recommended. The options are as follows:
|
dataset_type |
No |
Integer |
Obtain a dataset list based on the dataset type. Either data_type or dataset_type must be specified. The options are as follows:
|
data_sources |
Yes |
Input dataset path, which is used to synchronize source data (such as images, text files, and audio files) in the directory and its subdirectories to the dataset. For a table dataset, this parameter indicates the import directory. The work directory of a table dataset cannot be an OBS path in a KMS-encrypted bucket. |
|
work_path |
Yes |
Output dataset path, which is used to store output files such as label files. |
|
labels |
No |
List of Table 7 |
Dataset labels. This parameter must be imported when you create a text triplet dataset. |
schema |
No |
List of Table 9 |
Schema list, which is used to specify the name and type of the table header of a table dataset |
description |
No |
String |
Dataset description consisting of 0 to 256 characters without special characters (^!<>=&"'). The parameter is left blank by default. |
Name |
Mandatory |
Type |
Description |
---|---|---|---|
type |
Yes |
Integer |
Data type. The options are as follows:
|
path |
Yes |
String |
Data source path
|
content_info |
No |
Dataset asset downloaded from the AI Gallery |
|
annotation_config |
No |
Data labeling format, which can be:
|
|
with_column_header |
No |
Boolean |
Whether the first row of a table is the table header. This parameter is mandatory for table datasets.
|
Name |
Mandatory |
Type |
Description |
---|---|---|---|
content_id |
Yes |
String |
Dataset asset ID in AI Gallery |
version_id |
Yes |
String |
Dataset asset version ID in AI Gallery |
Name |
Mandatory |
Type |
Description |
---|---|---|---|
scene |
Yes |
String |
Supported labeling scenarios. The options are as follows:
|
format_name |
Yes |
String |
Labeling format in different scenarios. The options are as follows:
|
parameters |
No |
Advanced labeling format parameters, such as the sample separator |
Name |
Mandatory |
Type |
Description |
---|---|---|---|
included_labels |
No |
List of Table 7 |
Import only samples with specified labels. |
sample_label_separator |
No |
String |
Separator between text and labels. The separator contains only one character, which must be a letter, digit, or one of the following characters (@#¥%^&*_=|?/':.;,). The separator must be escaped. |
label_separator |
No |
String |
Separator between labels. The separator contains only one character, which must be a letter, digit, or one of the following characters (@#¥%^&*_=|?/':.;,). The separator must be escaped. |
difficult_only |
No |
Boolean |
Whether to import only hard examples. |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
type |
Yes |
Integer |
Data type. The options are as follows:
|
path |
Yes |
String |
Output dataset path, which is used to store output files such as label files.
|
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
name |
Yes |
String |
Label name |
type |
Yes |
Integer |
Label type. The options are as follows:
|
property |
No |
Basic attribute key-value pair of a label, such as color |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
@modelarts:color |
No |
String |
(Built-in attribute) Label color, which is a hexadecimal code of the color. By default, this parameter is left blank. For example, #FFFFF0. |
@modelarts:from_type |
No |
String |
(Built-in attribute) Type of the head entity in a triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is only used in text triplet datasets. |
@modelarts:to_type |
No |
String |
(Built-in attribute) Type of the tail entity in a triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is only used in text triplet datasets. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.