Updated on 2024-10-29 GMT+08:00

Creating a Dataset Labeling Phase

Description

This phase integrates capabilities of the ModelArts dataset module, allowing you to label datasets. The labeling phase is used to create labeling jobs or label existing jobs.

Parameter Overview

You can use LabelingStep to create a labeling phase. The following is an example of defining a LabelingStep.

Table 1 LabelingStep

Parameter

Description

Mandatory

Data Type

name

Name of a labeling phase. The name contains a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-). It must start with a letter and must be unique in a workflow.

Yes

str

inputs

Inputs of the labeling phase.

Yes

LabelingInput or LabelingInput list

outputs

Outputs of the labeling phase.

Yes

LabelingOutput or LabelingOutput list

properties

Configurations for dataset labeling.

Yes

LabelTaskProperties

title

Title for frontend display.

No

str

description

Description of the labeling phase.

No

str

policy

Phase execution policy.

No

StepPolicy

depend_steps

Dependent phases.

No

Step or step list

Table 2 LabelingInput

Parameter

Description

Mandatory

Data Type

name

Input name of the labeling phase. The name can contain a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-), and must start with a letter. The input name of a step must be unique.

Yes

str

data

Input data object of the labeling phase.

Yes

Dataset or labeling job object. Currently, only Dataset, DatasetConsumption, DatasetPlaceholder, LabelTask, LabelTaskPlaceholder, LabelTaskConsumption, and DataConsumptionSelector are supported.

Table 3 LabelingOutput

Parameter

Description

Mandatory

Data Type

name

Output name of the labeling phase. The name can contain a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-), and must start with a letter. The output name of a step must be unique.

Yes

str

Table 4 LabelTaskProperties

Parameter

Description

Mandatory

Data Type

task_type

Type of a labeling job. Jobs of the specified type are returned.

Yes

LabelTaskTypeEnum

task_name

Labeling job name. The value contains 1 to 100 characters, including only letters, digits, hyphens (-), and underscores (_).

This parameter is mandatory when the input is a dataset object.

No

str, Placeholder

labels

Labels to be created.

No

Label

properties

Attributes of a labeling job. You can update this field to record custom information.

No

dict

auto_sync_dataset

Whether to automatically synchronize the result of a labeling job to the dataset. The options are as follows:

  • true: The labeling result of the labeling job is automatically synchronized to the dataset. (Default)
  • false: The labeling result of the labeling job is not automatically synchronized to the dataset.

No

bool

content_labeling

Whether to enable content labeling for speech paragraph labeling. This function is enabled by default.

No

bool

description

Labeling job description. The description contains 0 to 256 characters and does not support the following special characters: ^!<>=&"'

No

str

Table 5 Label

Parameter

Description

Mandatory

Data Type

name

Tag name

No

str

property

Basic attribute key-value pair of a label, such as color and shortcut keys

No

str, dic, Placeholder

type

Tag type

No

LabelTypeEnum

Enumeration

Value

LabelTaskTypeEnum

IMAGE_CLASSIFICATION

OBJECT_DETECTION

IMAGE_SEGMENTATION

TEXT_CLASSIFICATION

NAMED_ENTITY_RECOGNITION

TEXT_TRIPLE

AUDIO_CLASSIFICATION

SPEECH_CONTENT

SPEECH_SEGMENTATION

DATASET_TABULAR

VIDEO_ANNOTATION

FREE_FORMAT

Sample Code of a Dataset Labeling Phase

There are three scenarios:

  • Scenario 1: Creating a labeling job for a specified dataset and labeling the dataset

    Scenarios:

    • You have created only one unlabeled dataset and need to label it when the workflow is running.
    • After a dataset is imported, the dataset needs to be labeled.
    Data preparation: Create a dataset on the ModelArts console.
    from modelarts import workflow as wf
    # Use LabelingStep to create a labeling job for the input dataset and label it.
    
    # Define an input dataset.
    dataset = wf.data.DatasetPlaceholder(name="input_dataset")
    
    # Define the name parameters of the labeling job.
    task_name = wf.Placeholder(name="placeholder_name", placeholder_type=wf.PlaceholderType.STR)
    
    labeling = wf.steps.LabelingStep(
        name="labeling", # Name of the labeling phase. The name contains a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-). It must start with a letter and must be unique in a workflow.
        title="Dataset Labeling", # Title, which defaults to the value of name
        properties=wf.steps.LabelTaskProperties(
            task_type=wf.data.LabelTaskTypeEnum.IMAGE_CLASSIFICATION,   # Labeling job type, for example, image classification
            task_name=task_name   # If the labeling job name does not exist, a job will be created using this name. If the labeling job name exists, the corresponding job will be used.
        ),
        inputs=wf.steps.LabelingInput(name="input_name", data=dataset), # LabelingStep inputs. The dataset object is configured when the workflow is running. You can also use wf.data.Dataset(dataset_name="fake_dataset_name") for the data field.
        outputs=wf.steps.LabelingOutput(name="output_name"), # LabelingStep outputs
    )
    
    workflow = wf.Workflow(
        name="labeling-step-demo",
        desc="this is a demo workflow",
        steps=[labeling]
    )
  • Scenario 2: Labeling a specified job

    Scenarios:

    • You have created a labeling job and need to label it when the workflow is running.
    • After a dataset is imported, the dataset needs to be labeled.
    Data preparation: Create a labeling job using a specified dataset on the ModelArts console.
    from modelarts import workflow as wf
    # Input a labeling job and label it.
    
    # Define a dataset labeling job.
    label_task = wf.data.LabelTaskPlaceholder(name="label_task_placeholder_name")
    
    labeling = wf.steps.LabelingStep(
        name="labeling", # Name of the labeling phase. The name contains a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-). It must start with a letter and must be unique in a workflow.
        title="Dataset Labeling", # Title, which defaults to the value of name
        inputs=wf.steps.LabelingInput(name="input_name", data=label_task), # LabelingStep inputs. The labeling job object is configured when the workflow is running. You can also use wf.data.LabelTask(dataset_name="dataset_name", task_name="label_task_name") for the data field.
        outputs=wf.steps.LabelingOutput(name="output_name"), # LabelingStep outputs
    )
    
    workflow = wf.Workflow(
        name="labeling-step-demo",
        desc="this is a demo workflow",
        steps=[labeling]
    )
  • Scenario 3: Creating a labeling job based on the output of the dataset creation phase

    Scenario: The outputs of the dataset creation phase are used as the inputs of the labeling phase.

    from modelarts import workflow as wf
    
    # Define parameters of the dataset output path.
    dataset_output_path = wf.Placeholder(name="dataset_output_path", placeholder_type=wf.PlaceholderType.STR, placeholder_format="obs")
    
    # Define the dataset name.
    dataset_name = wf.Placeholder(name="dataset_name", placeholder_type=wf.PlaceholderType.STR)
    
    create_dataset = wf.steps.CreateDatasetStep(
        name="create_dataset", # Name of a dataset creation phase. The name contains a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-). It must start with a letter and must be unique in a workflow.
        title="Dataset Creation", # Title, which defaults to the value of name
        inputs=wf.steps.CreateDatasetInput(name="input_name", data=wf.data.OBSPlaceholder(name="obs_placeholder_name", object_type="directory")),# CreateDatasetStep inputs, configured when the workflow is running; the data field can also be represented by the wf.data.OBSPath(obs_path="fake_obs_path") object.
        outputs=wf.steps.CreateDatasetOutput(name="create_dataset_output", config=wf.data.OBSOutputConfig(obs_path=dataset_output_path)),# CreateDatasetStep outputs
        properties=wf.steps.DatasetProperties(
            dataset_name=dataset_name, # If the dataset name does not exist, a dataset will be created using this name. If the dataset name exists, the corresponding dataset will be used.
            data_type=wf.data.DataTypeEnum.IMAGE, # Data type of the dataset, for example, image
        )
    )
    
    # Define the name parameters of the labeling job.
    task_name = wf.Placeholder(name="placeholder_name", placeholder_type=wf.PlaceholderType.STR)
    
    labeling = wf.steps.LabelingStep(
        name="labeling", # Name of the labeling phase. The name contains a maximum of 64 characters, including only letters, digits, underscores (_), and hyphens (-). It must start with a letter and must be unique in a workflow.
        title="Dataset Labeling", # Title, which defaults to the value of name
        properties=wf.steps.LabelTaskProperties(
            task_type=wf.data.LabelTaskTypeEnum.IMAGE_CLASSIFICATION,   # Labeling job type, for example, image classification
            task_name=task_name   # If the labeling job name does not exist, a job will be created using this name. If the labeling job name exists, the corresponding job will be used.
        ),
        inputs=wf.steps.LabelingInput(name="input_name", data=create_dataset.outputs["create_dataset_output"].as_input()), # LabelingStep inputs. The data source is the outputs of the dataset creation phase.
        outputs=wf.steps.LabelingOutput(name="output_name"), # LabelingStep outputs
        depend_steps=create_dataset # Preceding dataset creation phase
    )
    # create_dataset is an instance of wf.steps.CreateDatasetStep. create_dataset_output is the name field value of wf.steps.CreateDatasetOutput.
    
    workflow = wf.Workflow(
        name="labeling-step-demo",
        desc="this is a demo workflow",
        steps=[create_dataset, labeling]
    )