Creating a Dataset Phase
Description
This phase integrates capabilities of the ModelArts dataset module, allowing you to create datasets of the new version. This phase is used to centrally manage existing data by creating datasets. It is usually followed by a dataset import phase or a labeling phase.
Parameter Overview
You can use CreateDatasetStep to create a dataset creation phase. The following is an example of defining a CreateDatasetStep.
Parameter |
Description |
Mandatory |
Data Type |
---|---|---|---|
name |
Name of a dataset creation phase. The name contains a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-). It must start with a letter and must be unique in a workflow. |
Yes |
str |
inputs |
Inputs of the dataset creation phase. |
Yes |
CreateDatasetInput or a list of CreateDatasetInput |
outputs |
Outputs of the dataset creation phase. |
Yes |
CreateDatasetOutput or a list of CreateDatasetOutput |
properties |
Configurations for dataset creation. |
Yes |
DatasetProperties |
title |
Title for frontend display. |
No |
str |
description |
Description of the dataset creation phase. |
No |
str |
policy |
Phase execution policy. |
No |
StepPolicy |
depend_steps |
Dependent phases. |
No |
Step or step list |
Parameter |
Description |
Mandatory |
Data Type |
---|---|---|---|
name |
Input name of the dataset creation phase. The name can contain a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-), and must start with a letter. The input name of a step must be unique. |
Yes |
str |
data |
Input data object of the dataset creation phase. |
Yes |
OBS object. Currently, only OBSPath, OBSConsumption, OBSPlaceholder, and DataConsumptionSelector are supported. |
Parameter |
Description |
Mandatory |
Data Type |
---|---|---|---|
name |
Output name of the dataset creation phase. The name can contain a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-), and must start with a letter. The output name of a step must be unique. |
Yes |
str |
config |
Output configurations of the dataset creation phase. |
Yes |
Currently, only OBSOutputConfig is supported. |
Parameter |
Description |
Mandatory |
Data Type |
---|---|---|---|
dataset_name |
Dataset name. The value contains 1 to 100 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed. |
Yes |
str, Placeholder |
dataset_format |
Dataset format. The default value is 0, indicating the file type. |
No |
0: file 1: table |
data_type |
Data type. The default value is FREE_FORMAT. |
No |
DataTypeEnum |
description |
Description |
No |
str |
import_data |
Whether to import data. The default value is False. Currently, only table data is supported. |
No |
bool |
work_path_type |
Type of the dataset output path. Currently, only OBS is supported. The default value is 0. |
No |
int |
import_config |
Configurations for label import. The default value is None. When creating a dataset based on labeled data, you can specify this parameter to import labeling information. |
No |
ImportConfig |
Parameter |
Description |
Mandatory |
Data Type |
---|---|---|---|
import_annotations |
Whether to automatically import the labeling information in the input directory, supporting detection, image classification, and text classification. The options are as follows:
|
No |
str, Placeholder |
import_type |
Import mode. The options are as follows:
|
No |
0: file type ImportTypeEnum |
annotation_format_config |
Configurations of the imported labeling format. |
No |
DAnnotationFormaTypeEtConumfig list |
Parameter |
Description |
Mandatory |
Data Type |
---|---|---|---|
format_name |
Name of a labeling format |
No |
AnnotationFormatEnum |
scene |
Labeling scenario, which is optional |
No |
LabelTaskTypeEnum |
Enumeration |
Value |
---|---|
ImportTypeEnum |
DIR MANIFEST |
DataTypeEnum |
IMAGE TEXT AUDIO TABULAR VIDEO FREE_FORMAT |
AnnotationFormatEnum |
MA_IMAGE_CLASSIFICATION_V1 MA_IMAGENET_V1 MA_PASCAL_VOC_V1 YOLO MA_IMAGE_SEGMENTATION_V1 MA_TEXT_CLASSIFICATION_COMBINE_V1 MA_TEXT_CLASSIFICATION_V1 MA_AUDIO_CLASSIFICATION_DIR_V1 |
Examples
There are two scenarios:
- Creating a dataset using unlabeled data
- Creating a dataset using labeled data with labels imported
Creating a dataset using unlabeled data
Data preparation: Store unlabeled data in an OBS folder.
from modelarts import workflow as wf # Use CreateDatasetStep to create a dataset of the new version using OBS data. # Define parameters of the dataset output path. dataset_output_path = wf.Placeholder(name="dataset_output_path", placeholder_type=wf.PlaceholderType.STR, placeholder_format="obs") # Define the dataset name. dataset_name = wf.Placeholder(name="dataset_name", placeholder_type=wf.PlaceholderType.STR) create_dataset = wf.steps.CreateDatasetStep( name="create_dataset", # Name of a dataset creation phase. The name contains a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-). It must start with a letter and must be unique in a workflow. title="Dataset Creation", # Title, which defaults to the value of name inputs=wf.steps.CreateDatasetInput(name="input_name", data=wf.data.OBSPlaceholder(name="obs_placeholder_name", object_type="directory")),# CreateDatasetStep inputs, configured when the workflow is running; the data field can also be represented by the wf.data.OBSPath(obs_path="fake_obs_path") object. outputs=wf.steps.CreateDatasetOutput(name="output_name", config=wf.data.OBSOutputConfig(obs_path=dataset_output_path)),# CreateDatasetStep outputs properties=wf.steps.DatasetProperties( dataset_name=dataset_name, # If the dataset name does not exist, a dataset will be created using this name. If the dataset name exists, the corresponding dataset will be used. data_type=wf.data.DataTypeEnum.IMAGE, # Data type of the dataset, for example, image ) ) # Ensure that the dataset name is not used by others under the account. Otherwise, the dataset created by others will be used in the subsequent phases. workflow = wf.Workflow( name="create-dataset-demo", desc="this is a demo workflow", steps=[create_dataset] )
Creating a dataset using labeled data with labels imported
Data preparation: Store labeled data in an OBS folder.
For details about specifications for importing labeled data from an OBS directory, see Specifications for Importing Data from an OBS Directory.
from modelarts import workflow as wf # Use CreateDatasetStep to create a dataset of the new version using OBS data. # Define parameters of the dataset output path. dataset_output_path = wf.Placeholder(name="dataset_placeholder_name", placeholder_type=wf.PlaceholderType.STR, placeholder_format="obs") # Define the dataset name. dataset_name = wf.Placeholder(name="dataset_placeholder_name", placeholder_type=wf.PlaceholderType.STR) create_dataset = wf.steps.CreateDatasetStep( name="create_dataset", # Name of a dataset creation phase. The name contains a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-). It must start with a letter and must be unique in a workflow. title="Dataset Creation", # Title, which defaults to the value of name inputs=wf.steps.CreateDatasetInput(name="input_name", data=wf.data.OBSPlaceholder(name="obs_placeholder_name", object_type="directory")),# CreateDatasetStep inputs, configured when the workflow is running; the data field can also be represented by the wf.data.OBSPath(obs_path="fake_obs_path") object. outputs=wf.steps.CreateDatasetOutput(name="output_name", config=wf.data.OBSOutputConfig(obs_path=dataset_output_path)),# CreateDatasetStep outputs properties=wf.steps.DatasetProperties( dataset_name=dataset_name, # If the dataset name does not exist, a dataset will be created using this name. If the dataset name exists, the corresponding dataset will be used. data_type=wf.data.DataTypeEnum.IMAGE, # Data type of the dataset, for example, image import_config=wf.steps.ImportConfig( annotation_format_config=[ wf.steps.AnnotationFormatConfig( format_name=wf.steps.AnnotationFormatEnum.MA_IMAGE_CLASSIFICATION_V1, # Labeling format of labeled data scene=wf.data.LabelTaskTypeEnum.IMAGE_CLASSIFICATION # Labeling scene ] ) ) ) # Ensure that the dataset name is not used by others under the account. Otherwise, the dataset created by others will be used in the subsequent phases. workflow = wf.Workflow( name="create-dataset-demo", desc="this is a demo workflow", steps=[create_dataset] )
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot