Updated on 2024-05-30 GMT+08:00

Creating a Dataset

Function

This API is used to create a dataset.

Debugging

You can debug this API through automatic authentication in API Explorer or use the SDK sample code generated by API Explorer.

URI

POST /v2/{project_id}/datasets

Table 1 Path Parameters

Parameter

Mandatory

Type

Description

project_id

Yes

String

Project ID. For details about how to obtain a project ID, see Obtaining a Project ID and Name.

Request Parameters

Table 2 Request body parameters

Parameter

Mandatory

Type

Description

data_format

No

String

Data format. Options:

  • Default: default format

  • CarbonData: CarbonData (supported only by table datasets)

data_sources

Yes

Array of DataSource objects

Input dataset path, which is used to synchronize source data (such as images, text files, and audio files) in the directory and its subdirectories to the dataset. For a table dataset, this parameter indicates the import directory. The work directory of a table dataset cannot be an OBS path in a KMS-encrypted bucket. Only one data source can be imported at a time.

dataset_name

Yes

String

Dataset name. The value contains 1 to 100 characters. Only letters, digits, underscores (_), and hyphens (-) are allowed, for example, dataset-9f3b.

dataset_type

No

Integer

Dataset type. Options:

  • 0: image classification

  • 1: object detection

  • 3: image segmentation

  • 100: text classification

  • 101: named entity recognition

  • 102: text triplet

  • 200: sound classification

  • 201: speech content

  • 202: speech paragraph labeling

  • 400: table dataset

  • 600: video labeling

  • 900: custom format

description

No

String

Dataset description. The value is empty by default. The description contains 0 to 256 characters and does not support the following special characters: ^!<>=&"'

import_annotations

No

Boolean

Indicates whether to automatically import the labeling information in the input directory. Object detection, image classification, and text classification are supported. The options are as follows:

  • true: Import the annotation information in the input directory (default value).

  • false: The annotation information in the input directory is not imported.

import_data

No

Boolean

Whether to import data. This parameter is used only for table datasets. Options:

  • true: Import data when creating a database.

  • false: Do not import data when creating a database. (Default value)

label_format

No

LabelFormat object

Label format information. This parameter is used only for text datasets.

labels

No

Array of Label objects

Dataset label list.

managed

No

Boolean

Whether to host a dataset. Options:

  • true: Host a dataset.

  • false: Do not host a dataset. (Default value)

schema

No

Array of Field objects

Schema list.

work_path

Yes

String

Output dataset path, which is used to store output files such as label files.

  • The format is /Bucket name/File path, for example, /obs-bucket/flower/rose/. (The directory is used as the path.)

  • A bucket cannot be directly used as a path.

  • The output dataset path is different from the input dataset path or its subdirectory.

  • The value contains 3 to 700 characters.

work_path_type

Yes

Integer

Type of the dataset output path. The default value is 0, indicating an OBS bucket.

workforce_information

No

WorkforceInformation object

Team labeling information.

workspace_id

No

String

Workspace ID. If no workspace is created, the default value is 0. If a workspace is created and used, use the actual value.

Table 3 DataSource

Parameter

Mandatory

Type

Description

data_path

No

String

Data source path.

data_type

No

Integer

Data type. Options:

  • 0: OBS bucket (default value)

  • 1: GaussDB(DWS)

  • 2: DLI

  • 3: RDS

  • 4: MRS

  • 5: AI Gallery

  • 6: Inference service

schema_maps

No

Array of SchemaMap objects

Schema mapping information corresponding to the table data.

source_info

No

SourceInfo object

Information required for importing a table data source.

with_column_header

No

Boolean

Whether the first row in the file is a column name. This field is valid for the table dataset. Options:

  • true: The first row in the file is the column name.

  • false: The first row in the file is not the column name.

Table 4 SchemaMap

Parameter

Mandatory

Type

Description

dest_name

No

String

Name of the destination column.

src_name

No

String

Name of the source column.

Table 5 SourceInfo

Parameter

Mandatory

Type

Description

cluster_id

No

String

MRS cluster ID. You can log in to the MRS console to view the information.

cluster_mode

No

String

Running mode of an MRS cluster. Options:

  • 0: normal cluster

  • 1: security cluster

cluster_name

No

String

MRS cluster name You can log in to the MRS console to view the information.

database_name

No

String

Name of the database to which the table dataset is imported.

input

No

String

HDFS path of the table data set. For example, /datasets/demo.

ip

No

String

IP address of your GaussDB(DWS) cluster.

port

No

String

Port number of your GaussDB(DWS) cluster.

queue_name

No

String

DLI queue name of a table dataset.

subnet_id

No

String

Subnet ID of an MRS cluster.

table_name

No

String

Name of the table to which a table dataset is imported.

user_name

No

String

Username, which is mandatory for GaussDB(DWS) data.

user_password

No

String

User password, which is mandatory for GaussDB(DWS) data.

vpc_id

No

String

ID of the VPC where an MRS cluster resides.

Table 6 LabelFormat

Parameter

Mandatory

Type

Description

label_type

No

String

Label type of text classification. Options:

  • 0: The label is separated from the text, and they are distinguished by the fixed suffix _result. For example, the text file is abc.txt, and the label file is abc_result.txt.

  • 1: Default value. Labels and texts are stored in the same file and separated by separators. You can use text_sample_separator to specify the separator between the text and label and text_label_separator to specify the separator between labels.

text_label_separator

No

String

Separator between labels. By default, a comma (,) is used as the separator. The separator needs to be escaped. The separator can contain only one character, such as a letter, a digit, or any of the following special characters: !@#$%^&*_=|?/':.;,

text_sample_separator

No

String

Separator between the text and label. By default, the Tab key is used as the separator. The separator needs to be escaped. The separator can contain only one character, such as a letter, a digit, or any of the following special characters: !@#$%^&*_=|?/':.;,

Table 7 Label

Parameter

Mandatory

Type

Description

attributes

No

Array of LabelAttribute objects

Multi-dimensional attribute of a label. For example, if the label is music, attributes such as style and artist may be included.

name

No

String

Label name.

property

No

LabelProperty object

Basic attribute key-value pair of a label, such as color and shortcut keys.

type

No

Integer

Label type. Options:

  • 0: image classification

  • 1: object detection

  • 3: image segmentation

  • 100: text classification

  • 101: named entity recognition

  • 102: text triplet relationship

  • 103: text triplet entity

  • 200: sound classification

  • 201: speech content

  • 202: speech paragraph labeling

  • 600: video labeling

Table 8 LabelAttribute

Parameter

Mandatory

Type

Description

default_value

No

String

Default value of a label attribute.

id

No

String

Label attribute ID. You can query the tag by invoking the tag list.

name

No

String

Label attribute name. The value contains a maximum of 64 characters and cannot contain the character. <>=&"'.

type

No

String

Label attribute type. Options:

  • text: text

  • select: single-choice drop-down list

values

No

Array of LabelAttributeValue objects

List of label attribute values.

Table 9 LabelAttributeValue

Parameter

Mandatory

Type

Description

id

No

String

Label attribute value ID.

value

No

String

Label attribute value.

Table 10 LabelProperty

Parameter

Mandatory

Type

Description

@modelarts:color

No

String

Default attribute: Label color, which is a hexadecimal code of the color. By default, this parameter is left blank. Example: #FFFFF0.

@modelarts:default_shape

No

String

Default attribute: Default shape of an object detection label (dedicated attribute). By default, this parameter is left blank. Options:

  • bndbox: rectangle

  • polygon: polygon

  • circle: circle

  • line: straight line

  • dashed: dotted line

  • point: point

  • polyline: polyline

@modelarts:from_type

No

String

Default attribute: Type of the head entity in the triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is used only for the text triplet dataset.

@modelarts:rename_to

No

String

Default attribute: The new name of the label.

@modelarts:shortcut

No

String

Default attribute: Label shortcut key. By default, this parameter is left blank. For example: D.

@modelarts:to_type

No

String

Default attribute: Type of the tail entity in the triplet relationship label. This attribute must be specified when a relationship label is created. This parameter is used only for the text triplet dataset.

Table 11 Field

Parameter

Mandatory

Type

Description

description

No

String

Schema description.

name

No

String

Schema name.

schema_id

No

Integer

Schema ID.

type

No

String

Schema value type.

Table 12 WorkforceInformation

Parameter

Mandatory

Type

Description

data_sync_type

No

Integer

Synchronization type. Options:

  • 0: not to be synchronized

  • 1: data to be synchronized

  • 2: label to be synchronized

  • 3: data and label to be synchronized

repetition

No

Integer

Number of persons who label each sample. The minimum value is 1.

synchronize_auto_labeling_data

No

Boolean

Whether to synchronously update auto labeling data. Options:

  • true: Update auto labeling data synchronously.

  • false: Do not update auto labeling data synchronously.

synchronize_data

No

Boolean

Whether to synchronize updated data, such as uploading files, synchronizing data sources, and assigning imported unlabeled files to team members. Options:

  • true: Synchronize updated data to team members.

  • false: Do not synchronize updated data to team members.

task_id

No

String

ID of a team labeling task.

task_name

Yes

String

Name of a team labeling task. The name contains 1 to 64 characters, including only letters, digits, underscores (_), and hyphens (-).

workforces_config

No

WorkforcesConfig object

Manpower assignment of a team labeling task. You can delegate the administrator to assign the manpower or do it by yourself.

Table 13 WorkforcesConfig

Parameter

Mandatory

Type

Description

agency

No

String

Administrator

workforces

No

Array of WorkforceConfig objects

List of teams that execute labeling tasks.

Table 14 WorkforceConfig

Parameter

Mandatory

Type

Description

workers

No

Array of Worker objects

List of labeling team members.

workforce_id

No

String

ID of a labeling team.

workforce_name

No

String

Name of a labeling team. The value contains 0 to 1024 characters and does not support the following special characters: !<>=&"'

Table 15 Worker

Parameter

Mandatory

Type

Description

create_time

No

Long

Creation time.

description

No

String

Labeling team member description. The value contains 0 to 256 characters and does not support the following special characters: ^!<>=&"'

email

No

String

Email address of a labeling team member.

role

No

Integer

Role. Options:

  • 0: labeling personnel

  • 1: reviewer

  • 2: team administrator

  • 3: dataset owner

status

No

Integer

Current login status of a labeling team member. Options:

  • 0: The invitation email has not been sent.

  • 1: The invitation email has been sent but the user has not logged in.

  • 2: The user has logged in.

  • 3: The labeling team member has been deleted.

update_time

No

Long

Update time.

worker_id

No

String

ID of a labeling team member.

workforce_id

No

String

ID of a labeling team.

Response Parameters

Status code: 201

Table 16 Response body parameters

Parameter

Type

Description

dataset_id

String

Dataset ID.

error_code

String

Error code.

error_msg

String

Error message.

import_task_id

String

ID of an import task.

Example Requests

  • Creating an Image Classification Dataset

    {
      "workspace_id" : "0",
      "dataset_name" : "dataset-457f",
      "dataset_type" : 0,
      "data_sources" : [ {
        "data_type" : 0,
        "data_path" : "/test-obs/classify/input/animals/"
      } ],
      "description" : "",
      "work_path" : "/test-obs/classify/output/",
      "work_path_type" : 0,
      "labels" : [ {
        "name" : "Rabbits",
        "type" : 0,
        "property" : {
          "@modelarts:color" : "#3399ff"
        }
      }, {
        "name" : "Bees",
        "type" : 0,
        "property" : {
          "@modelarts:color" : "#3399ff"
        }
      } ]
    }
  • Creating an Object Detection Dataset

    {
      "workspace_id" : "0",
      "dataset_name" : "dataset-95a6",
      "dataset_type" : 1,
      "data_sources" : [ {
        "data_type" : 0,
        "data_path" : "/test-obs/detect/input/animals/"
      } ],
      "description" : "",
      "work_path" : "/test-obs/detect/output/",
      "work_path_type" : 0,
      "labels" : [ {
        "name" : "Rabbits",
        "type" : 1,
        "property" : {
          "@modelarts:color" : "#3399ff"
        }
      }, {
        "name" : "Bees",
        "type" : 1,
        "property" : {
          "@modelarts:color" : "#3399ff"
        }
      } ]
    }
  • Creating a Table Dataset

    {
      "workspace_id" : "0",
      "dataset_name" : "dataset-de83",
      "dataset_type" : 400,
      "data_sources" : [ {
        "data_type" : 0,
        "data_path" : "/test-obs/table/input/",
        "with_column_header" : true
      } ],
      "description" : "",
      "work_path" : "/test-obs/table/output/",
      "work_path_type" : 0,
      "schema" : [ {
        "schema_id" : 1,
        "name" : "150",
        "type" : "STRING"
      }, {
        "schema_id" : 2,
        "name" : "4",
        "type" : "STRING"
      }, {
        "schema_id" : 3,
        "name" : "setosa",
        "type" : "STRING"
      }, {
        "schema_id" : 4,
        "name" : "versicolor",
        "type" : "STRING"
      }, {
        "schema_id" : 5,
        "name" : "virginica",
        "type" : "STRING"
      } ],
      "import_data" : true
    }

Example Responses

Status code: 201

Created

{
  "dataset_id" : "WxCREuCkBSAlQr9xrde"
}

Status Codes

Status Code

Description

201

Created

401

Unauthorized

403

Forbidden

404

Not Found

Error Codes

See Error Codes.