Help Center/ GaussDB/ User Guide/ Database Migration/ Using gs_loader to Import Data
Updated on 2024-12-25 GMT+08:00

Using gs_loader to Import Data

Scenarios

You can use gs_loader to import the files exported by using the copy to command. The gs_loader tool converts the syntax supported by control files into \copy syntax, then leverages the existing \copy function to import main data. At the same time, gs_loader logs the results of the \copy operations to a log file.

Precautions

  • gs_loader does not support M-compatible databases.
  • Before using gs_loader, ensure that the gs_loader version is consistent with the gsql version and database version.
  • Currently, gs_loader is only available for centralized instances.
  • When you use gs_loader to import data, if transcoding is not required, the size of a single row of data (including tuple metadata, same as mentioned below) is less than 1 GB to 1 B. If transcoding is required, the size of a single row of data is less than 256 MB to 1 B. Special handling has been applied to the following transcoding scenarios: the size for UTF-8 -> GB18030/GB18030_2022 transcoding is less than 512 MB to 1 B, and the size for UTF-8 -> GBK transcoding is less than 1 GB to 1 B.
  • It is recommended that the size of a single file to be imported be less than or equal to 1 GB. gs_loader has no limit on the size of a single file to be imported. However, importing a large file is time-consuming. Therefore, you are advised to split a large file, and start multiple gs_loader processes to write data to the table in append mode. (If there is a need to truncate data, it should be done by performing a separate TRUNCATE operation, rather than writing the TRUNCATE statement into the control file.) When the CPU resources are sufficient, this method can effectively improve the import speed.

Preparing for Data Migration

  1. Prepare an ECS or a device that can access the GaussDB instance over EIP.
    • To connect to a GaussDB instance through an ECS, you must first create an ECS.

      For details on how to create and log in to an ECS, see Purchasing an ECS and Logging In to an ECS in Elastic Cloud Server Getting Started.

    • To connect to a GaussDB instance through a device that can access the GaussDB instance over EIP, you must:
      1. Bind an EIP to the GaussDB instance. For details, see Binding an EIP.
      2. Ensure that the local device can access the EIP that has been bound to the GaussDB instance.
  2. Install the gsql client on the prepared ECS or device that can access the GaussDB database, and connect it to the GaussDB database instance. For details, see Using gsql to Connect to a Database.

Procedure

  1. Create a control file and prepare a data file.

    1. Create a control file, for example, /tmp/data/loader.ctl, and import data to the loader_tbl table. WHEN requires that the second character in each row be a comma (,).
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      LOAD DATA
      truncate into table loader_tbl
      WHEN (2:2) = ',' 
      fields terminated by ','
      trailing nullcols
      (
          id integer external,
          name char(32),
          con ":id || '-' || :name",
          dt date
      )
      
    2. Create a GUC parameter file, for example, /tmp/data/guc.txt.
      1
      set a_format_copy_version='s1';
      
    3. Create a data file, for example, /tmp/data/data.csv.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      1,OK,,2007-07-8
      2,OK,,2008-07-8
      3,OK,,2009-07-8
      4,OK,,2007-07-8
      43,DISCARD,,2007-07-8
      ,,,
      32,DISCARD,,2007-07-8
      a,ERROR int,,2007-07-8
      8,ERROR date,,2007-37-8
      ,,,,
       ,
      8,ERROR fields,,2007-37-8
      ,,,
      5,OK,,2021-07-30
      

  2. Create a user and grant privileges to the user.

    CREATE USER load_user WITH PASSWORD '************';
    GRANT ALL ON SCHEMA public TO load_user;
    SELECT copy_summary_create() WHERE NOT EXISTS(SELECT * FROM pg_tables WHERE schemaname='public' AND tablename='gs_copy_summary');
    GRANT ALL PRIVILEGES ON  public.gs_copy_summary To load_user;
    SELECT copy_error_log_create() WHERE NOT EXISTS(SELECT * FROM pg_tables WHERE schemaname='public' AND tablename='pgxc_copy_error_log');
    GRANT ALL PRIVILEGES ON  public.pgxc_copy_error_log To load_user;
    \c - load_user
    Password for user load_user:

  3. Create a target table to import data. For example, you can create the loader_tbl table in the gs_example database.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    \c gs_example
    Password for user load_user:
    CREATE TABLE loader_tbl
    (
        ID   NUMBER,
        NAME VARCHAR2(20),
        CON  VARCHAR2(20),
        DT   DATE
    );
    

  4. Import the data.

    Exit the current login connection.

    1
    \q
    

    Before importing data, ensure that gs_loader has the required privileges. Ensure that the current directory has write privileges (gs_loader generates some temporary files when importing data. The files are automatically deleted after the import is completed).

    1
    gs_loader control=/tmp/data/loader.ctl data=/tmp/data/data.csv db=gs_example bad=/tmp/data/loader.bad guc_param=/tmp/data/guc.txt errors=5 port=8000 passwd=************ user=load_user -h 192.*.*.139;
    
    The output is shown as follows:
    1
    2
    3
    4
    5
    6
    gs_loader: version 0.1
    
     5 Rows successfully loaded. 
    
    log file is: 
     /tmp/data/loader.log
    

    In the /tmp/data/data.csv file, the first four rows and the last row are imported. Rows 5 and 7 were ignored because the second character was not a comma (,). Rows 6, 10, and 13 were skipped because all fields were empty. Rows 8, 9, and 12 were not imported due to incorrect field values. The file defined as bad=/tmp/data/loader.bad can be used to view the erroneous lines, and the execution result is in /tmp/data/loader.log, which records the imported log information.

Table 1 gs_loader parameters

Parameter

Description

Parameter Type/Value Range

help

Help information.

-

user

Database connection user (equivalent to -U).

Character string

-U

Database connection user (equivalent to user).

Character string

passwd

User password (equivalent to -W).

Character string

-W

User password (equivalent to passwd).

Character string

db

(Required) Database name. This parameter is equivalent to -d.

Character string

-d

(Required) Database name. This parameter is equivalent to db.

Character string

host

Host name of the running server, the path of the Unix domain socket, or the domain name. You can specify multiple host addresses by using character strings separated by commas (,). This parameter is equivalent to -h.

If multiple host addresses are specified, the primary node is connected by default.

See the gsql --host parameter.

-h

Host name of the running server, the path of the Unix domain socket, or the domain name. You can specify multiple host addresses by using character strings separated by commas (,). This parameter is equivalent to host.

If multiple host addresses are specified, the primary node is connected by default.

See the gsql --host parameter.

port

Port number of the database server. One or more port numbers can be configured. When one port number is configured, all IP addresses use the same port for connection. When multiple port numbers are configured, the sequence is the same as the IP address sequence, and the number of port numbers must be the same as the number of IP addresses. If they are different, an error is reported. This parameter is equivalent to -p.

See the gsql --port parameter.

-p

Port number of the database server. One or more port numbers can be configured. When one port number is configured, all IP addresses use the same port for connection. When multiple port numbers are configured, the sequence is the same as the IP address sequence, and the number of port numbers must be the same as the number of IP addresses. If they are different, an error is reported. This parameter is equivalent to port.

See the gsql --port parameter.

create

Specifies whether to create the pgxc_copy_error_log and gs_copy_summary tables. In the current version, the two tables are created by default. Therefore, this parameter is meaningless. This parameter is reserved only for compatibility.

The value can be true or false. The default value is true.

data

(Required) Data file. You can specify multiple data files or use wildcards (*) and question marks (?) to represent multiple data files.

Character string

control

(Required) Name of a control file.

Character string

log

Name of a log file.

Character string

bad

Name of the file that records the error rows and details. You can also specify a directory. If you do not specify a directory, the file is generated based on the data file name.

Character string

errors

Maximum number of error rows in a data file.

Integer. The default value is 0.

limit

Maximum number of rows that can be imported.

Integer. By default, the value is infinite.

  • All parameters are in lowercase and are compatible with the gsql login mode, including -p port number, -h host, -d database, -U username, and -W password.
  • When the rows parameter is specified, the number of commit times cannot exceed 1,000. Otherwise, the performance will be affected. The number of commit times is approximately equal to the number of data rows in the data file divided by the value of rows. If the rows parameter is not specified, there is no default value for rows. In this case, the transaction is committed only once after all data is imported to the table.
  • Frequent commit of a small amount of data affects the data import performance. You are advised to set the rows parameter properly to ensure that the amount of data committed each time is greater than 5 MB. For common servers with 16 vCPUs | 128 GB specifications, in the scenario where one primary node and two standby nodes are deployed and 13 GB of data is imported to a table with five columns, the rate of multiple commits is about 10 MB/s, which is basically same as that of a single commit (5 MB data is committed each time; network impacts are not considered).
  • Currently, gs_loader are only compatible with NUL characters in data files. It does not support NUL characters in CTL files. If the .ctl file contains the nul character, unexpected problems may occur.

For details about other parameters and control file syntax, see gs_loader in Tool Reference for Centralized Instances.