Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Situation Awareness
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
Software Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Creating a Dataset

Updated on 2024-05-06 GMT+08:00

Before using ModelArts to manage data, create a dataset. Then, you can perform operations on the dataset, such as labeling data, importing data, and publishing the dataset. This section describes how to create a dataset of the non-table type (image, audio, text, video, and free format) and table type.

Prerequisites

  • You have been authorized to access OBS. To do so, click the Settings page in the navigation pane of the ModelArts management console and add access authorization using an agency.
  • OBS buckets and folders for storing data are available. In addition, the OBS buckets and ModelArts are in the same region. OBS parallel file systems are not supported. Select object storage.
  • OBS buckets are not encrypted. ModelArts does not support encrypted OBS buckets. When creating an OBS bucket, do not enable bucket encryption.

Image, Audio, Text, Video, and Free Format

  1. Log in to the ModelArts management console. In the navigation pane, choose Data Management > Datasets.
  2. Click Create. On the Create Dataset page, create a dataset based on the data type and data labeling requirements.
    Figure 1 Parameter settings
    • Name: name of the dataset, which is customizable
    • Description: details about the dataset
    • Data Type: Select a data type based on your needs.
    • Data Source
      1. Importing data from OBS

        If data is available in OBS, select OBS for Data Source, and set Import Mode, Import Path, Labeling Status, and Labeling Format (mandatory when Labeling Status is set to Labeled). The labeling formats of the input data vary depending on the dataset type. For details about the labeling formats supported by ModelArts, see Importing Data.

      2. Importing data from a local path

        If data is not stored in OBS and the required data cannot be downloaded from AI Gallery, ModelArts enables you to upload the data from a local path. Before uploading data, configure Storage Path and Labeling Status. Click Upload data to select the local file for uploading. Select a labeling format when the labeling status is Labeled. The labeling formats of the input data vary depending on the dataset type. For details about the labeling formats supported by ModelArts, see Importing Data.

        Figure 2 Selecting Local file
    • For more details about parameters, see Table 1.
      Table 1 Dataset parameters

      Parameter

      Description

      Import Path

      OBS path from which your data is to be imported. This path is used as the data storage path of the dataset.

      NOTE:

      OBS parallel file systems are not supported. Select an OBS bucket.

      When you create a dataset, data in the OBS path will be imported to the dataset. If you modify data in OBS, the data in the dataset will be inconsistent with that in OBS. As a result, certain data may be unavailable. To modify data in a dataset, follow the operations provided in Import Mode or Importing Data from an OBS Path.

      If the numbers of samples and labels of the dataset exceed quotas, importing the samples and labels will fail.

      Labeling Status

      Labeling status of the selected data, which can be Unlabeled or Labeled.

      If you select Labeled, specify a labeling format and ensure the data file complies with format specifications. Otherwise, the import may fail.

      Only image (object detection, image classification, and image segmentation), audio (sound classification), and text (text classification) labeling tasks support the import of labeled data.

      Output Dataset Path

      OBS path where your labeled data is stored.

      NOTE:
      • Ensure that your OBS path name contains letters, digits, and underscores (_) and does not contain special characters, such as ~'@#$%^&*{}[]:;+=<>/ and spaces.
      • The dataset output path cannot be the same as the data input path or subdirectory of the data input path.
      • It is a good practice to select an empty directory as the dataset output path.
      • OBS parallel file systems are not supported. Select an OBS bucket.
  3. After setting the parameters, click Submit.

Table

  1. Log in to the ModelArts management console. In the navigation pane, choose Data Management > Datasets.
  2. Click Create. On the Create Dataset page, create a table dataset based on the data type and data labeling requirements.
    Figure 3 Parameters of a table dataset
    • Name: name of the dataset, which is customizable
    • Description: details about the dataset
    • Data Type: Select a data type based on your needs.
    • For more details about parameters, see Table 2.
      Table 2 Dataset parameters

      Parameter

      Description

      Data Source (OBS)

      • File Path: Browse all OBS buckets of the account and select the directory where the data file to be imported is located.
      • Contain Table Header: This setting is enabled by default, indicating that the imported file contains table headers.
        • If the original table contains table headers and this setting is enabled, first rows (table header) of the imported file are used as column names. You do not need to modify the schema information.
        • If the original table does not contain table headers, you need to disable this setting and change column names in Schema to attr_1, attr_2, ..., and attr_n. attr_n is the last column, indicating the prediction column.

      For details about OBS functions, see Object Storage Service Console Operation Guide.

      Data Source (DWS)

      • Cluster Name: All DWS clusters of the current account are automatically displayed. Select the required DWS cluster from the drop-down list.
      • Database Name: Enter the name of the database where the data is located based on the selected DWS cluster.
      • Table Name: Enter the name of the table where the data is located based on the selected database.
      • User Name: Enter the username of the DWS cluster administrator.
      • Password: Enter the password of the DWS cluster administrator.

      For details about DWS functions, see Data Warehouse Service User Guide.

      NOTE:

      To import data from DWS, use DLI functions. If you do not have the permission to access DLI, create a DLI agency as prompted.

      Data Source (DLI)

      • Queue Name: All DLI queues of the current account are automatically displayed. Select the required queue from the drop-down list.
      • Database Name: All databases are displayed based on the selected queue. Select the required database from the drop-down list.
      • Table Name: All tables in the selected database are displayed. Select the required table from the drop-down list.

      For details about DLI functions, see Data Lake Insight User Guide.

      Data Source (MRS)

      • Cluster Name: All MRS clusters of the current account are automatically displayed. However, streaming clusters do not support data import. Select the required cluster from the drop-down list.
      • File Path: Enter the HDFS file path based on the selected cluster.
      • Contain Table Header: If this setting is enabled, the imported file contains table headers.

      For details about MRS functions, see MapReduce Service User Guide.

      Local file

      Storage Path: Select an OBS path.

      Schema

      Names and types of table columns, which must be the same as those of the imported data. Set the column name based on the imported data and select the column type. For details about the supported types, see Table 3.

      Click Add Schema to add a new record. When creating a dataset, you must specify a schema. Once created, the schema cannot be modified.

      When data is imported from OBS, the schema of the CSV file in the file path is automatically obtained. If the schemas of multiple CSV files are inconsistent, an error will be reported.

      NOTE:

      After you select data from OBS, column names in Schema are automatically displayed, which is the first-row data of the table by default. To ensure the correct prediction code, you need to change column names in Schema to attr_1, attr_2, ..., and attr_n. attr_n is the last column, indicating the prediction column.

      Output Dataset Path

      OBS path for storing table data. The data imported from the data source is stored in this path. The path cannot be the same as the file path in the OBS data source or subdirectories of the file path.

      After a table dataset is created, the following four directories are automatically generated in the storage path:

      • annotation: version publishing directory. Each time a version is published, a subdirectory with the same name as the version is generated in this directory.
      • data: data storage directory. Imported data is stored in this directory.
      • logs: directory for storing logs.
      • temp: temporary working directory.
      Table 3 Schema data types

      Type

      Description

      Storage Space

      Range

      String

      String type

      N/A

      N/A

      Short

      Signed integer

      2 bytes

      -32768 to 32767

      Int

      Signed integer

      4 bytes

      -2147483648 to 2147483647

      Long

      Signed integer

      8 bytes

      -9223372036854775808 to 9223372036854775807

      Double

      Double-precision floating point

      8 bytes

      N/A

      Float

      Single-precision floating point

      4 bytes

      N/A

      Byte

      Signed integer

      1 byte

      -128 to 127

      Date

      Date type in the format of "yyyy-MM-dd", for example, 2014-05-29

      N/A

      N/A

      Timestamp

      Timestamp that represents date and time in the format of "yyyy-MM-dd HH:mm:ss"

      N/A

      N/A

      Boolean

      Boolean type

      1 byte

      TRUE/FALSE

      NOTE:

      When using a CSV file, pay attention to the following:

      • When the data type is set to String, the data in the double quotation marks is regarded as one record by default. Ensure the double quotation marks in the same row are closed. Otherwise, the data will be too large to display.
      • If the number of columns in a row of the CSV file is different from that defined in the schema, the row will be ignored.
  3. After setting the parameters, click Submit.

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback