Best Practices for Data Import

Importing Data from OBS in Parallel

Splitting a data file into multiple files
Importing a huge amount of data takes a long period of time and consumes many computing resources.

To improve the performance of importing data from OBS, split a data file into multiple files as evenly as possible before importing it to OBS. The preferred number of split files is an integer multiple of the DN quantity.
Verifying data files before and after an import
When importing data from OBS, first import your files to your OBS bucket, and then verify that the bucket contains all the correct files, and only those files.

After the import is complete, run the SELECT statement to verify that the required files have been imported.
Ensuring no Chinese characters are contained in paths used for importing data to or exporting data from OBS.

Using GDS to Import Data

Data skew causes the query performance to deteriorate. Before importing all the data from a table containing over 10 million records, you are advised to import some of the data and check whether there is data skew and whether the distribution keys need to be changed. Troubleshoot the data skew if any. It is costly to address data skew and change the distribution keys after a large amount of data has been imported. For details, see Checking for Data Skew.
To speed up the import, you are advised to split files and use multiple Gauss Data Service (GDS) tools to import data in parallel. An import task can be split into multiple concurrent import tasks. If multiple import tasks use the same GDS, you can specify the -t parameter to enable GDS multi-thread concurrent import. To prevent physical I/O and network bottleneck, you are advised to mount GDSs to different physical disks and NICs.
If the GDS I/O and NICs do not reach their physical bottlenecks, you can enable SMP on GaussDB(DWS) for acceleration. SMP will multiply the pressure on GDSs. Note that SMP adaptation is implemented based on the GaussDB(DWS) CPU pressure rather than the GDS pressure. For more information about SMP, see SMP Manual Optimization Suggestions.
For the proper communication between GDSs and GaussDB(DWS), you are advised to use 10GE networks. 1GE networks cannot bear the high-speed data transmission, and, as a result, cannot ensure proper communication between GDSs and GaussDB(DWS). To maximize the import rate of a single file, ensure that a 10GE network is used and the data disk group I/O rate is greater than the upper limit of the GDS single-core processing capability (about 400 MB/s).
Similar to the single-table import, ensure that the I/O rate is greater than the maximum network throughput in the concurrent import.
It is recommended that the ratio of GDS quantity to DN quantity be in the range of 1:3 to 1:6.
To improve the efficiency of importing data in batches to column-store partitioned tables, the data is buffered before being written into a disk. You can specify the number of buffers and the buffer size by setting partition_mem_batch and partition_max_cache_size, respectively. Smaller values indicate the slower the batch import to column-store partitioned tables. The larger the values, the higher the memory consumption.

Using INSERT to Insert Multiple Rows

If the COPY statement cannot be used during data import, you can use multi-row inserts to insert data in batches. Multi-row inserts improve performance by batching up a series of inserts.

The following example inserts three rows into a three-column table using a single INSERT statement. This is still a small insert, shown simply to illustrate the syntax of a multi-row insert.

To insert multiple rows of data to the table customer_t1, run the following statement:

    
         INSERT INTO customer_t1 VALUES 
(6885, 'maps', 'Joes'),
(4321, 'tpcds', 'Lily'),
(9527, 'world', 'James');

For more details and examples, see INSERT.

Using the COPY Statement to Import Data

The COPY statement imports data from local and remote databases in parallel. COPY imports large amounts of data more efficiently than INSERT statements.

For how to use the COPY command, see Running the COPY FROM STDIN Statement to Import Data.

Using a gsql Meta-Command to Import Data

The \copy command can be used to import data after you log in to a database through any gsql client. Compared with the COPY command, the \copy command directly reads or writes local files instead of reading or writing files on the database server.

Data read or written using the \copy command is transferred through the connection between the server and the client and may not be efficient than the SQL COPY command. The COPY statement is recommended when the amount of data is large.

For how to use the \copy command, see Using a gsql Meta-Command to Import Data.

\copy only applies to small-batch data import with uniform formats but poor error tolerance capability. GDS or COPY is preferred for data import.

Parent topic: Import and Export

Previous topic: Import and Export

Next topic: GDS Practice Guide