Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive

Parallel OBS Data Export

Updated on 2022-07-29 GMT+08:00

Overview

GaussDB(DWS) databases allow you to export data in parallel using OBS foreign tables, in which the export mode and the exported data format are specified. Data is exported in parallel through multiple DNs from GaussDB(DWS) to the OBS server, improving the overall export performance.
  • The CN only plans data export tasks and delivers the tasks to DNs. In this case, the CN is released to process external requests.
  • The computing capability and bandwidth of all the DNs are fully leveraged to export data.
  • You can concurrently export data using multiple OBS services, but the bucket and object paths specified for the export tasks must be different and cannot be null.
  • The OBS server connects to GaussDB(DWS) cluster nodes. The export rate is affected by the network bandwidth.
  • The TEXT and CSV data file formats are supported. The size of data in a single row must be less than 1 GB.
  • Data in ORC format is supported only by 8.1.0 or later.

Related Concepts

  • Source data file: a TEXT or CSV file that stores data.
  • OBS: a cloud storage service used to store unstructured data, such as documents, images, and videos. Data objects concurrently exported from GaussDB(DWS) are stored on the OBS server.
  • Bucket: a container storing objects on OBS.
    • Object storage is a flat storage mode. Layered file system structures are not needed because all objects in buckets are at the same logical layer.
    • In OBS, each bucket name must be unique and cannot be changed. A default access control list (ACL) is created with a bucket. Each item in the ACL contains permissions granted to certain users, such as READ, WRITE, and FULL_CONTROL. Only authorized users can perform bucket operations, such as creating, deleting, viewing, and setting ACLs for buckets.
    • A user can create a maximum of 100 buckets. The total data size and the number of objects and files in each bucket are not limited.
  • Object: a basic data storage unit in OBS. Data uploaded by users is stored in OBS buckets as objects. Object attributes include Key, Metadata, and Data.

    Generally, objects are managed as files. However, OBS has no file system–related concepts, such as files and folders. To let users easily manage data, OBS allows them to simulate folders. Users can add a slash (/) in the object name, for example, tpcds1000/stock.csv. In this name, tpcds1000 is regarded as the folder name and stock.csv the file name. The value of key (object name) is still tpcds1000/stock.csv, and the content of the object is the content of the stock.csv file.

  • Key: name of an object. It is a UTF-8 character sequence containing 1 to 1024 characters. A key value must be unique in a bucket. Users can name the objects they stored or obtained as Bucket name+Object name.
  • Metadata: object metadata, which contains information about the object. There are system metadata and user metadata. The metadata is uploaded to OBS as key-value pairs together with HTTP headers.
    • System metadata is generated by OBS and used for processing object data. System metadata includes Date, Content-length, last-modify, and Content-MD5.
    • User metadata contains object descriptions specified by users for uploading objects.
  • Data: object content, which is regarded by OBS as stateless binary data.
  • Foreign table: A foreign table is used to identify data in a source data file. It stores information, such as the location, format, destination location, encoding format, and data delimiter of a source data file.

Principles

The following describes the principles of exporting data from a cluster to OBS by using a distributed hash table or a replication table.

  • Distributed hash table: the table for which DISTRIBUTE BY HASH (Column_Name) is specified in the table creation statement.

    A distributed hash table stores data in hash mode. Figure 1 shows how to export data from table (T2) to OBS as an example.

    During table data storage, the col2 hash column in table T2 is hashed, and a hash value is generated. The tuple is distributed to corresponding DNs for storage according to the mapping between the DNs and the hash value.

    When data is exported to OBS, DNs that store the exported data of T2 directly export their data files to OBS. Original data on multiple nodes will be exported in parallel.

    Figure 1 Hash distribution principle
  • Replication table: the table for which DISTRIBUTE BY REPLICATION is specified in the table creation statement.

    A replication table stores a package of complete table data on each GaussDB(DWS) node. When exporting data to OBS, GaussDB(DWS) randomly selects a DN for export.

Naming Rules of Exported Files

Rules for naming the files exported from GaussDB(DWS) to OBS are as follows:

  • Data exported from DNs is stored on OBS in segment format. The file is named as Table name_Node name_segment.n. n is a natural number starting from 0, for example, 0, 1, 2, 3.

    For example, the data of table t1 on datanode3 will be exported as t1_datanode3_segment.0, t1_datanode3_segment.1, and so on.

    You are advised to export data from different clusters or databases to different OBS buckets or different paths of the same OBS bucket.

  • Each segment can store a maximum of 1 GB data, with no tuples sliced. If data stored in a segment exceeds 1 GB, the excess data will be stored in the second segment.

    For example:

    A segment has already stored 100 pieces of tuples (1023 MB) when datanode3 exports data from t1 to OBS. If a 5 MB tuple is inserted to the segment, the data size becomes 1028 MB. In this case, file t1_datanode3_segment.0 (1023 MB) is generated and stored on OBS, and the new tuple is stored on OBS as file t1_datanode3_segment.1.

  • When data is exported from a distributed hash table, the number of segments generated on each DN depends on the data volume stored on a DN, not on the number of DNs in the cluster. Data stored in hash mode may not be evenly distributed on each DN.

    For example, a cluster has DataNode1, DataNode2, DataNode3, DataNode4, DataNode5, and DataNode6, which store 1.5 GB, 0.7 GB, 0.6 GB, 0.8 GB, 0.4 GB, and 0.5 GB data, respectively. Seven OBS segment files will be generated during data export because DataNode1 will generate two segment files, which store 1 GB and 0.5 GB data, respectively.

Data Export Process

Figure 2 Concurrent data export
Table 1 Process description

Procedure

Description

Subtask

Plan data export.

Create an OBS bucket and a folder in the OBS bucket as the directory for storing exported data files.

For details, see Planning Data Export.

-

Create an OBS foreign table.

Create a foreign table to help OBS specify information about data files to be exported. The foreign table stores information, such as the destination location, format, encoding, and data delimiter of a source data file.

For details, see Creating an OBS Foreign Table.

-

Export data.

After the foreign table is created, run the INSERT statement to efficiently export data to data files.

For details, see Exporting Data.

-

We use cookies to improve our site and your experience. By continuing to browse our site you accept our cookie policy. Find out more

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback