Help Center> GaussDB> Distributed_2.x> Best Practices> Best Practices of Data Import
Updated on 2023-10-23 GMT+08:00

Best Practices of Data Import

Using GDS to Import Data

  • Data skew deteriorates the query performance. Before importing all the data from a table containing over 10 million records, you are advised to import some of the data and check whether there is data skew and whether the distribution keys need to be changed. Troubleshoot the data skew if any. It is costly to address data skew and change the distribution keys after a large amount of data has been imported. For details, see Checking for Data Skew.
  • To speed up the import, you are advised to split files and use multiple Gauss Data Services (GDSs) to import data in parallel. An import task can be split into multiple concurrent import tasks. If multiple import tasks use the same GDS, you can specify the -t parameter to enable GDS multi-thread concurrent import. To prevent physical I/O and network bottleneck, you are advised to mount GDSs to different physical disks and NICs.
  • To ensure normal job execution, configure robust system resources in the physical environment where GDSs are located based on the load and concurrency of GDSs. The system resources include but are not limited to the memory size, number of handles, and available space of the disk corresponding to the GDS data directory. If GDSs are deployed outside the GaussDB cluster, ensure that their physical environment configuration is consistent with that in the cluster.
  • If the GDS I/O and NICs do not reach their physical bottleneck, you can enable SMP on GaussDB for acceleration. SMP will multiply pressure on GDSs. Note that SMP adaptation is implemented based on the GaussDB CPU pressure rather than the GDS pressure.
  • The communication between GDS and GaussDB must be smooth. 10GE network is recommended. Gigabit networks cannot bear the high-speed data transmission. That is, Gigabit networks cannot guarantee the network communications of GaussDB. To maximize the import speed of a single file, ensure that a 10GE network is used, and the data disk group I/O rate is greater than the upper limit of the GDS single-core processing capability (about 400 Mbit/s).
  • Similar to the single-table import, ensure that the I/O rate is greater than the maximum network throughput in the concurrent import.
  • You are advised to deploy one or two GDSs on a RAID of a data server.
  • It is recommended that the ratio of GDS quantity to DN quantity be in the range of 1:3 to 1:6.
  • To improve the efficiency of importing data in batches to column-store partitioned tables, the data is buffered before being written into a disk. You can specify the number of buffers and the buffer size by setting partition_mem_batch and partition_max_cache_size, respectively. The smaller the values, the slower the batch import to column-store partitioned tables. The larger the values, the higher the memory consumption.

Using INSERT to Insert Multiple Rows

If the COPY statement cannot be used and you require SQL insert, use multi-row insert whenever possible. If you use a column-store table and insert one or more rows at a time, the data compression efficiency is low.

Multi-row insert improves performance by batching up a series of inserts. The following example inserts three rows into a three-column table using a single INSERT statement. This is still a small insert, shown simply to illustrate the syntax of a multi-row insert. For details about how to create a table, see Creating and Managing Tables.

To insert multiple rows of data to the table customer_t1, run the following command:

1
2
3
4
openGauss=# INSERT INTO customer_t1 VALUES 
(68, 'a1', 'zhou','wang'),
(43, 'b1', 'wu', 'zhao'),
(95, 'c1', 'zheng', 'qian');

For more details and examples, see INSERT.

Using COPY to Import Data

The COPY statement imports data from local and remote databases in parallel. It imports large amounts of data more efficiently than using INSERT statements.

For details about how to use the COPY statement, see Running the COPY FROM STDIN Statement to Import Data.

Using a gsql Meta-Command to Import Data

The \copy command can be used to import data after you log in to a database through any psql client. Unlike the COPY statement, the \copy command reads from or writes to a file.

Data read or written using the \copy command is transferred through the connection between the server and the client and may not be efficient. The COPY statement is recommended when the amount of data is large.

For details about how to use the \copy command, see Using a gsql Meta-Command to Import Data.

\copy applies only to small-scale data import in good format. It does not preprocess invalid characters nor provide error tolerance. Therefore, \copy cannot be used in scenarios where abnormal data exists. GDS or COPY is preferred for data import.

Using INSERT for Bulk Insert

Use a bulk insert operation with a SELECT clause for high-performance data insertion.

Use the INSERT and CREATE TABLE AS statements when you need to move data or a subset of data from one table into another.

Assume that you have created a backup table customer_t2 for table customer_t1. To insert data from customer_t1 to customer_t2, run the following statements:

1
2
3
4
5
6
7
8
openGauss=# CREATE TABLE customer_t2
(
    c_customer_sk             integer,
    c_customer_id             char(5),
    c_first_name              char(6),
    c_last_name               char(8)
);
openGauss=# INSERT INTO customer_t2 SELECT * FROM customer_t1;

The preceding example is equivalent to:

1
openGauss=# CREATE TABLE customer_t2 AS SELECT * FROM customer_t1;