Parallel Data Import
Principles
Importing data in parallel on multiple nodes fully uses the computing and I/O capabilities of the nodes to maximize speed. The parallel data import function of GaussDB(DWS) implements high-speed and parallel import of external data in a specified format (CSV or TEXT).
- The CN only plans and delivers data import tasks, and the DNs execute these tasks. This reduces CN resource usage, enabling the CN to process external requests.
- The computing capability and network bandwidth of all the DNs are fully utilized, improving data import performance.
Process |
Description |
---|---|
Creating a table that complies with the Hash distribution policy |
When running the CREATE TABLE statement, a service application presets the Hash distribution policy (specifies an attribute of a table as a distribution field). |
Setting the partitioning policy |
When executing the CREATE TABLE statement, a service application presets a partitioning rule (specifies an attribute of a table as a partitioning field). All Hash data in each DN is partitioned based on the preset partitioning rule. |
|
During data import, GDS splits a specified data file into data blocks with a fixed size. |
|
DNs download these data blocks from GDS in parallel. |
|
Each DN processes data blocks in parallel and parses out a data tuple from the data blocks. The physical location of each tuple is determined based on the Hash value obtained based on the distribution column.
|
Writing data into partitions |
After data is sent to the node where Hash is used, it is written into the partition data file based on the partitioning logic. While data is written into a partitioned table in GaussDB(DWS), you can exchange partitions to improve the writing performance. |
General Data Service (GDS): Multiple GDSs can be deployed on a data server to improve the import performance. |
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.