Updated on 2024-06-07 GMT+08:00

gs_loader

Overview

gs_loader is used to import data. gs_loader converts the syntax supported by the control file to the \COPY syntax, uses the existing \COPY function to import data, and records the \COPY result in logs.

Before using gs_loader, ensure that the gs_loader version is consistent with the gsql version and database version.

gs_loader does not support M-compatible databases.

Installation and Deployment

Install and configure the gs_loader client tool on the server where source data files are stored so that you can use the gs_loader tool to import data.

  1. Create a directory for storing the gs_loader tool package.

    mkdir -p /opt/bin

  2. Upload the gsql tool package to the created directory.

    For example, upload the gsql tool package GaussDB-Kernel_Database version number_OS version number_64bit_gsql.tar.gz in the software installation package to the created directory.

  3. Go to the directory where the tool package is located and decompress the package.

    cd /opt/bin
    tar -zxvf GaussDB-Kernel_Database version number_OS version number_64bit_gsql.tar.gz
    source gsql_env.sh

  4. Verify the tool location and version information.

    which gs_loader

  5. Verify the client version information.

    The gs_loader tool version number corresponds to the gsql tool version number. You can directly query the gsql client version number to verify the client version information.
    gsql -V

  6. Verify that the database version is the same as the client tool version.

    Use gsql to connect to the database and run the following command:
    select version();

Log Level Configuration

Set the log level for developers to view. After the setting, the tool running information is printed on the console.

export gs_loader_log_level=debug
export gs_loader_log_level=info
export gs_loader_log_level=warning
export gs_loader_log_level=error

Permission

The application scenarios are classified into separation-of-duties and non-separation-of-duties scenarios. You can set enableSeparationOfDuty to on or off to enable or disable the separation of duties function.

The enable_copy_error_log parameter specifies whether to use the error table pgxc_copy_error_log. The default value is off, indicating that the error table is not used and error records are directly recorded in the .bad file of gs_loader. If this parameter is set to on, the error table pgxc_copy_error_log is used and error records are inserted into the error table.

By default, if enableSeparationOfDuty is set to off, the user can be a common database user or an administrator. If the user is a common user, the administrator needs to grant permissions to the common user. The administrator account can be used directly. The GUC parameter enable_copy_error_log determines whether to enable the error table pgxc_copy_error_log. By default, it is disabled.

  1. Create a user as an administrator.
    CREATE USER load_user WITH PASSWORD '************';
  2. Grant the public schema permission to the new user.
    GRANT ALL ON SCHEMA public TO load_user;
  3. Create the gs_copy_summary table and grant permissions to the new user.

    The gs_copy_summary table cannot contain objects that may cause privilege escalation, such as RULE, TRIGGER, index functions, row-level security, CHECK constraints, GENERATED columns, DEFAULT columns, and ON UPDATE columns; otherwise, the system considers that the user is created by a malicious user, reports an error, and exits.

    SELECT copy_summary_create() WHERE NOT EXISTS(SELECT * FROM pg_tables WHERE schemaname='public' AND tablename='gs_copy_summary');
    GRANT INSERT,SELECT ON  public.gs_copy_summary To load_user;
  4. (Optional) Create an error table pgxc_copy_error_log and grant permissions to the new user.
    • If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the error table and do not need to create it. Otherwise, you need to create the error table.
    • The pgxc_copy_error_log table cannot contain objects that may cause privilege escalation, such as RULE, TRIGGER, index functions, row-level security, CHECK constraints, GENERATED columns, DEFAULT columns, and ON UPDATE columns; otherwise, the system considers that the user is created by a malicious user, reports an error, and exits.
    SELECT copy_error_log_create() WHERE NOT EXISTS(SELECT * FROM pg_tables WHERE schemaname='public' AND tablename='pgxc_copy_error_log');
    GRANT INSERT,SELECT,DELETE ON  public.pgxc_copy_error_log To load_user;

If enableSeparationOfDuty is set to on, the user can be a common database user or an administrator. Create the pgxc_copy_error_log and gs_copy_summary tables in their respective schemas and add indexes. No permission granting is required.

  1. Create a user as the initial user.
    CREATE USER load_user WITH PASSWORD '********';
  2. Switch to the new user as the initial user.
    \c - load_user
  3. Create a gs_copy_summary table and add an index.
    CREATE TABLE load_user.gs_copy_summary(relname varchar, begintime timestamptz, endtime timestamptz, id bigint, pid bigint, readrows bigint, skiprows bigint, loadrows bigint, errorrows bigint, whenrows bigint, allnullrows bigint, detail text);
    CREATE INDEX gs_copy_summary_idx ON load_user.gs_copy_summary(id);
  4. (Optional) Create a pgxc_copy_error_log table and add an index.
    1. If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the error table and do not need to create it. Otherwise, you need to create the error table.
    CREATE TABLE load_user.pgxc_copy_error_log (relname varchar, begintime timestamptz, filename varchar, lineno int8, rawrecord text, detail text);
    CREATE INDEX copy_error_log_relname_idx ON load_user.pgxc_copy_error_log(relname);

Usage Environment

You need to add the tool path to PATH. gs_loader supports SSL encrypted communication. The method of using gs_loader is the same as that of using gsql.

Adding System Catalogs

The gs_copy_summary table is added to record the COPY execution result summary, including the number of successful rows, number of error rows, number of ignored rows, and number of empty rows.

The copy_summary_create function is added to create the gs_copy_summary table.

The format of the gs_copy_summary table is as follows:

relname     | public.sqlldr_tbl
begintime   | 2021-09-03 16:00:11.7129-04
endtime     | 2021-09-03 16:00:15.259908-04
id          | 21870
pid         | 47582725060352
readrows    | 100000
skiprows    | 0
loadrows    | 111
errorrows   | 0
whenrows    | 99889
allnullrows | 0
detail      | 111 Rows successfully loaded.
            | 0 Rows not loaded due to data errors.
            | 99889 Rows not loaded because all WHEN clauses were failed.
            | 0 Rows not loaded because all fields were null.
            | 

Columns in the gs_copy_summary System Catalog

Table 1 gs_copy_summary columns

Column

Description

relname

Name of the target table to be imported.

begintime

Start time of an import task.

endtime

End time of an import task.

id

ID of the transaction to be imported.

pid

ID of the worker thread for the current import.

readrows

Total number of data rows read by the import task.

skiprows

Total number of data rows skipped in the import task.

loadrows

Number of data rows successfully imported in the current import task.

errorrows

Number of error data rows in the current import task.

whenrows

Number of data rows that violate the WHEN filter criterion in the current import task.

allnullrows

Number of data rows where all columns are empty.

detail

Summary of the import task, including the number of successfully imported rows, number of error data rows, number of rows that violate the WHEN condition, and number of blank rows.

Usage Guidelines

  1. (If the separation of duties function is disabled) For common users only:

    1. Create a user (as an administrator).
      CREATE USER load_user WITH PASSWORD '************';
    2. Grant the public schema permission to the user (as an administrator).
      GRANT ALL ON SCHEMA public TO load_user;
    3. Create the gs_copy_summary table and grant table permissions to the user (as an administrator).
      SELECT copy_summary_create() WHERE NOT EXISTS(SELECT * FROM pg_tables WHERE schemaname='public' AND tablename='gs_copy_summary');
      GRANT ALL PRIVILEGES ON  public.gs_copy_summary To load_user;
    4. (Optional) Create the pgxc_copy_error_log table and grant table permissions to the user (as an administrator).

      If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the error table and do not need to create it. Otherwise, you need to create the error table.

      SELECT copy_error_log_create() WHERE NOT EXISTS(SELECT * FROM pg_tables WHERE schemaname='public' AND tablename='pgxc_copy_error_log');
      GRANT ALL PRIVILEGES ON  public.pgxc_copy_error_log To load_user;
    5. Switch to another user.
      \c - load_user

  2. (If the separation of duties function is enabled) For common users and administrators:

    1. Create a user (as the initial user).
      CREATE USER load_user WITH PASSWORD '********';
    2. Switch to the load_user user (as the initial user).
      \c - load_user
    3. Create a gs_copy_summary table and add an index.
      CREATE TABLE load_user.gs_copy_summary(relname varchar, begintime timestamptz, endtime timestamptz, id bigint, pid bigint, readrows bigint, skiprows bigint, loadrows bigint, errorrows bigint, whenrows bigint, allnullrows bigint, detail text);
      CREATE INDEX gs_copy_summary_idx ON load_user.gs_copy_summary(id);
    4. (Optional) Create a pgxc_copy_error_log table and add an index.

      If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the error table and do not need to create it. Otherwise, you need to create the error table.

      CREATE TABLE load_user.pgxc_copy_error_log (relname varchar, begintime timestamptz, filename varchar, lineno int8, rawrecord text, detail text);
      CREATE INDEX copy_error_log_relname_idx ON load_user.pgxc_copy_error_log(relname);

  3. Create a table and a control file, and prepare a data file.

    Create the loader_tbl table.

    1
    2
    3
    4
    5
    6
    7
    CREATE TABLE  loader_tbl
    (
        ID   NUMBER,
        NAME VARCHAR2(20),
        CON  VARCHAR2(20),
        DT   DATE
    );
    

    (On the gs_loader client) Create the control file loader.ctl.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    LOAD DATA
    truncate into table loader_tbl
    WHEN (2:2) = ',' 
    fields terminated by ','
    trailing nullcols
    (
        id integer external,
        name char(32),
        con ":id || '-' || :name",
        dt date
    )
    

    (On the gs_loader client) Create the GUC parameter file guc.txt.

    1
    set a_format_copy_version='s1';
    

    (On the gs_loader client) Create the data file data.csv.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    1,OK,,2007-07-8
    2,OK,,2008-07-8
    3,OK,,2009-07-8
    4,OK,,2007-07-8
    43,DISCARD,,2007-07-8
    ,,,
    32,DISCARD,,2007-07-8
    a,ERROR int,,2007-07-8
    8,ERROR date,,2007-37-8
    ,,,,
     ,
    8,ERROR fields,,2007-37-8
    ,,,
    5,OK,,2021-07-30
    

  4. Import the data.

    (On the gs_loader client) Before importing data, ensure that the gs_loader tool has the execute permission on the gs_loader tool. Ensure that the current path has the write permission on files. (The gs_loader generates some temporary files during the processing and automatically deletes them after the import is complete.)

    1
    gs_loader control=loader.ctl data=data.csv db=testdb bad=loader.bad guc_param=guc.txt errors=5 port=8000 passwd=************ user=load_user
    

    Execution result:

    1
    2
    3
    4
    5
    6
    gs_loader: version 0.1
     
     5 Rows successfully loaded. 
     
    log file is: 
     loader.log
    

gs_copy_summary is used to record the called COPY syntax and its details. [badfile]_bad.log is used to record error data and its details. To prevent the error data and details recorded during the last import from being overwritten, you are advised to use different bad parameters for each import. If the error table pgxc_copy_error_log is used to record error data and details, enable the GUC parameter enable_copy_error_log. To delete data from a table, perform the TRUNCATE or DELETE operation on the table.

Parameters

Table 2 gs_loader parameters

Parameter

Description

Parameter Type: Value Range

help

Displays help information.

-

user

Database connection user (equivalent to -U).

String

-U

Database connection user (equivalent to user).

String

passwd

User password (equivalent to -W).

String

-W

User password (equivalent to passwd).

String

db

Database name. This parameter is required and is equivalent to -d.

String

-d

Database name. This parameter is required and is equivalent to db.

String

host

Host name of the running server, the path of the Unix domain socket, or the domain name. You can specify multiple host addresses by using character strings separated by commas (,). This parameter is equivalent to -h.

If multiple host addresses are specified, the primary node is connected by default.

See the gsql --host parameter.

-h

Host name of the running server, the path of the Unix domain socket, or the domain name. You can specify multiple host addresses by using character strings separated by commas (,). This parameter is equivalent to host.

If multiple host addresses are specified, the primary node is connected by default.

See the gsql --host parameter.

port

Port number of the database server. One or more port numbers can be configured. When one port number is configured, all IP addresses use the same port for connection. When multiple port numbers are configured, the sequence is the same as the IP address sequence, and the number of port numbers must be the same as the number of IP addresses. If they are different, an error is reported. This parameter is equivalent to -p.

See the gsql --port parameter.

-p

Port number of the database server. One or more port numbers can be configured. When one port number is configured, all IP addresses use the same port for connection. When multiple port numbers are configured, the sequence is the same as the IP address sequence, and the number of port numbers must be the same as the number of IP addresses. If they are different, an error is reported. This parameter is equivalent to port.

See the gsql --port parameter.

create

Specifies whether to create the pgxc_copy_error_log and gs_copy_summary tables.

The value can be true or false. The default value is true.

clean

Specifies whether to clear the error record.

The value can be true or false. The default value is false.

data

(Required) Data file. You can specify multiple data files or use wildcards (*) and question marks (?) to represent multiple data files.

String

control

(Required) Name of a control file.

String

log

Name of a log file.

String

bad

Name of the file that records the error lines and details. You can also specify a directory. If you do not specify a directory, the file is generated based on the data file name.

String

discard

Name of the file recording the lines that fail to be matched by WHEN. You can also specify a directory to generate the file name based on the data file name.

String

errors

Maximum number of error lines in a data file.

Integer. Default value: 0.

skip

Number of first lines that can be skipped in a data file.

Integer. Default value: 0.

bindsize

Only syntax compatibility is implemented, but functions are not implemented.

-

rows

Number of rows of data to be imported before a commit.

The value is an integer, in the range [1,2147483647].

compatible_nul

Specifies whether to enable the compatibility of null characters (0x00) in data. After this function is enabled, if null characters exist in a data file, the null characters are converted to space characters (0x20), and then the data file is processed and imported.

The value can be true or false. The default value is true.

binary

Specifies whether the binary file is exported in COPY binary mode.

The value can be true or false. The default value is false.

  • All parameters are in lowercase and are compatible with the gsql login mode, including -p port number, -h host, -d database, -U username, and -W password.
  • When the rows parameter is specified, the number of commit times cannot exceed 1000. Otherwise, the performance is affected. The number of commit times is approximately equal to the number of data rows in the data file divided by the value of rows. If the rows parameter is not specified, there is no default value for rows. In this case, the transaction is committed only once after all data is imported to the table.
  • Frequent commit of a small amount of data affects the data import performance. You are advised to set the rows parameter properly to ensure that the amount of data committed each time is greater than 5 MB. For common 16U 128 GB servers, in the scenario where one primary node and two standby nodes are deployed and 13 GB data is imported to a table with five columns. The rate of multiple commits is about 10 MB/s, which is basically the same as that of a single commit (5 MB data is committed each time).
  • The compatible_nul parameter controls the value of the GUC parameter loader_support_nul_character.
    • compatible_nul=true corresponds to session-level set loader_support_nul_charchter='s2'.
    • compatible_nul=false corresponds to session-level set loader_support_nul_character='s1'.

      You are advised to set this parameter using the CLI. The priority of setting this parameter using compatible_nul is higher than that of setting this parameter using guc_param.

  • Currently, gs_loader supports compatibility only when data files contain nul characters. It does not support nul characters in .ctl control files. If the .ctl file contains the nul character, unexpected problems may occur.
  • After the binary parameter is set to true, the following requirements must be met:
    • The data file must be a binary file exported in \COPY binary mode. However, the data file exported in this mode has poor compatibility and portability. You are advised to use \COPY to import the data file.
    • gs_loader converts the syntax in the control file to the simplest syntax in \COPY binary mode, that is, \COPY table_name FROM 'binary_file_path' BINARY;. Only the import mode, table name, as well as control, data, binary, guc_param, and database connection parameters in the control file are parsed. Other parameters are not parsed and do not take effect.
    • The command lines and control files of gs_loader must meet the following requirements:
      • Character set configuration is not supported.
      • The WHEN filter and DISCARD operation are not supported.
      • Error data cannot be directly written to BAD files when enable_copy_error_log is set to off. The default value of errors is unlimited, indicating that encoding exception data is recorded by default.
      • The CSV mode is not supported, delimiters and wrappers cannot be specified, and the TRAILING NULLCOLS syntax is not supported.
      • Data type configuration, POSITION configuration, and column expression usage are not supported.
      • The FILLER, CONSTANT, SEQUENCE, and NULLIF parameters are not supported.
      • The skip, rows, and compatible_nul parameters are not supported.

Control Files

  • Syntax
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    LOAD [ DATA ]
    [CHARACTERSET char_set_name]
    [INFILE [directory_path] [filename ] ]
    [BADFILE [directory_path] [filename ] ]
    [OPTIONS(name=value)]
    [{ INSERT | APPEND | REPLACE | TRUNCATE }] 
    INTO TABLE table_name
    [{ INSERT | APPEND | REPLACE | TRUNCATE }] 
    [FIELDS CSV]
    [TERMINATED [BY] { 'string' }]
    [OPTIONALLY ENCLOSED BY { 'string' }]
    [TRAILING NULLCOLS]
    [ WHEN { (start:end) | column_name } {= | !=} 'string' ]
    [(
    col_name [ [ POSITION ({ start:end }) ]  ["sql_string"] ] | [ FILLER [column_type [external] ] ] | [ CONSTANT "string" ] | [ SEQUENCE ( { COUNT | MAX | integer } [, incr] ) ]|[NULLIF (COL=BLANKS)]
    [, ...]
    )]
    
  • Parameter description:
    • CHARACTERSET

      Character set.

      Value range: a string. Currently, the value can be 'AL32UTF8', 'zhs16gbk', or 'zhs32gb18030'.

      Note: The character set specified by CHARACTERSET in the control file must be the same as the encoding format of the file. Otherwise, an error is reported or garbled characters are displayed in the imported data.

    • INFILE

      The current keyword is invalid and needs to occupy a separate line in the control file. The keyword is ignored during running. You need to specify the corresponding data file in the gs_loader command line parameters.

    • BADFILE

      The current keyword is invalid and will be ignored during running. If no .bad file is specified in the gs_loader command, a .bad file will be generated based on the name of the corresponding control file.

    • OPTIONS

      Only the skip and rows parameters take effect. skip=n indicates that the first n records are skipped during import, and rows=n indicates the number of rows to be imported before a commit. If both the command line and control file are specified, the command line has a higher priority.

    • INSERT | APPEND | REPLACE | TRUNCATE

      Import mode.

      INSERT: If the table contains data, an error is reported.

      APPEND: Data is inserted directly.

      REPLACE: If the table contains data, all data is deleted and then inserted.

      TRUNCATE: If the table contains data, all data is deleted and then inserted.

      • When writing a control file (.ctl), you can specify the import mode (INSERT | APPEND | REPLACE | TRUNCATE) before and after the INTO TABLE table_name statement. The priority is as follows: The import mode specified after the statement takes precedence over and overwrites that specified before the statement.
      • When multiple gs_loader sessions are started to concurrently import data to the same table, you are advised to use the APPEND mode. If you use the INSERT, REPLACE, or TRUNCATE mode, an import error may occur or the imported data may be incomplete.
    • FIELDS CSV

      Specifies that the CSV mode of COPY is used. In CSV mode, the default separator is a comma (,), and the default quotation mark is a double quotation mark (").

      In the current CSV mode, newlines that are enclosed with double quotation marks are considered as part of the column data.

    • table_name

      Specifies the name (possibly schema-qualified) of an existing table.

      Value range: an existing table name

    • TERMINATED [BY] { 'string' }

      The string that separates columns within each row (line) of the file, and it cannot be larger than 10 bytes.

      Value range: The value cannot include any of the following characters: \.abcdefghijklmnopqrstuvwxyz0123456789 The nul character cannot be set as a separator.

      Value range: The default value is a tab character in text format and a comma in CSV format.

      After enabling nul character compatibility (compatible_nul=true), if the specified separator is a space character (0x20), note that the separator is the space character that exists in the data file instead of the space character converted from the nul character.

    • OPTIONALLY ENCLOSED BY { 'string' }

      Specifies a quoted character string for a CSV file.

      The default value is double quotation marks (") only in CSV mode that is explicitly specified by the FIELDS CSV parameter.

      In other modes, there is no default value.

      • When you set OPTIONALLY ENCLOSED BY { 'string' }, there should be no quotation mark on the left of the data; otherwise, the number of quotation marks on either left or right must be an odd number but they do not have to be equal.
      • Currently, OPTIONALLY ENCLOSED BY { 'string' } is supported only in CSV mode. If OPTIONALLY ENCLOSED BY { 'string' } is specified, the system enters the CSV mode by default.
    • TRAILING NULLCOLS

      Specifies how to handle the problem that multiple columns of a row in a source data file are lost during data import.

      If one or more columns at the end of a row are null, the columns are imported to the table as null values. If this parameter is not set, an error message is displayed, indicating that the error column is null. In this case, the data in this row is processed as an error.

    • WHEN { (start:end) | column_name } {= | !=}

      Filters rows by character string between start and end or by column name.

      Value range: a string.

      • When the GUC parameter enable_copy_when_filler is set to on (default value), data can be filtered based on the FILLER column. When the GUC parameter enable_copy_when_filler is set to off, this usage is not supported.
      • The WHEN condition cannot be followed by special characters such as '\0' and '\r'.
    • POSITION ({ start:end })

      Processes columns and obtain the corresponding character strings between start and end.

    • "sql_string"

      Processes columns and calculates column values based on column expressions. For details, see • Column expression.

      Value range: a string.

    • FILLER

      Processes columns. If FILLER occurs, this column is skipped.

      Currently, FILLER and POSITION ({ start:end }) cannot be used at the same time.

    • column_type [external]

      Processes the imported data according to different data types. For details, see • Data types.

    • CONSTANT

      Processes columns and sets the inserted columns to constants.

      Value range: a string.

    • SEQUENCE ( { COUNT | MAX | integer } [, incr] )

      Processes columns to generate the corresponding sequence values.

      • COUNT: The count starts based on the number of rows in the table.
      • MAX: The count starts from the maximum value of this column in the table.
      • integer: The count starts from the specified value.
      • incr: indicates the increment each time.
    • NULLIF

      Processes columns. In multi-row import scenarios, if sysdate, constant, position, or column expression is not specified after a column name, the column whose NULLIF keyword is not specified is left empty.

      Currently, only the COL POSITION() CHAR NULLIF (COL=BLANKS) syntax is supported. For details, see • NULLIF use cases.

    • OPTIONS, INFILE, and BADFILE are not supported. Syntax errors are not reported only in specific scenarios.
    • gs_loader uses a .bad file to record errors from the rawrecord column in an error table if the guc parameter enable_copy_error_log is set to enable the error table. The error table does not record rawrecord if an error cannot be read by certain code. In this case, a blank line is recorded in the .bad file.
    • When gs_loader sets the GUC parameter a_format_load_with_constraints_violation to support non-rollback upon constraint conflicts, if a table has a BEFORE/AFTER ROW INSERT trigger, a maximum of 10,000,000 rows can be committed at a time.
    • gs_loader does not support statement-level triggers when the GUC parameter a_format_load_with_constraints_violation is set to support non-rollback upon constraint conflicts.
  • If the data in the .bad file is empty, refer to the source file and row number in the error table. (The code sequence is not identified, the .bad file content is not written, and only blank rows are recorded.)
    loader=# select * from pgxc_copy_error_log;
           relname        |           begintime           | filename | lineno | rawrecord |                     detail                      
    ----------------------+-------------------------------+----------+--------+-----------+-------------------------------------------------
     public.test_gsloader | 2023-02-09 09:20:33.646843-05 | STDIN    |      1 |           | invalid byte sequence for encoding "UTF8": 0xb4
    (1 row)
    // In the preceding example, for the file corresponding to the loader, search for the first row of the data text to find the source data.
  • NULLIF use cases
    // Create a table.
    create table gsloader_test_nullif(
    col1   varchar2(100) not null enable,
    col2   number(5,0) not null enable,
    col3   varchar2(200) not null enable,
    col4   varchar2(34) not null enable,
    col5   varchar2(750),
    col6   number(20,0),
    col7   varchar2(4000),
    col8   varchar2(200)
    );
    // Data file test.csv
    6007 17060072021-09-0360070001102010000000230          1        600700010000218               0        1        1        229465        3
    6007 17060072021-09-0360070001102010000000299          1        600700010000282               0        1        1        230467        3
    6007 17060072021-09-0360070001102010000000242          1        600700010000255               0        1        1        226400        3
    6007 17060072021-09-0360070001102010000000202          1        600700010000288               0        1        1        219107        3
    6007 17060072021-09-0360070001102010000000294          1        600700010000243               0        1        1        204404        3
    6007 17060072021-09-0360070001102010000000217          1        600700010000270               0        1        1        226644        3
    // Control file test.ctl
    LOAD DATA  
    CHARACTERSET UTF8
    TRUNCATE
    INTO TABLE gsloader_test_nullif
    TRAILING NULLCOLS
    (COL1 POSITION(1:10) CHAR NULLIF (COL1 = BLANKS),
    COL2  POSITION(11:14) CHAR NULLIF (COL2 = BLANKS),
    COL3  POSITION(21:30) CHAR NULLIF (COL3 = BLANKS),
    COL4  POSITION(31:40) CHAR NULLIF (COL4 = BLANKS),
    COL5  sysdate,
    COL6,
    COL7, 
    COL8 POSITION(71:80) CHAR NULLIF (COL8 = BLANKS))
    // Import data.
    gs_loader -p xxx host=xxx control=test.ctl  data=test.csv -d testdb -W xxx 
    // Result: Imported.
    loader=# select * from gsloader_test_nullif;
        col1    | col2 |    col3    |    col4    |        col5         | col6 | col7 |   col8
    ------------+------+------------+------------+---------------------+------+------+-----------
     6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000218
     6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000282
     6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000255
     6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000288
     6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000243
     6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000270
    (6 rows)

    According to the data in the imported table, after the NULLIF keyword is used, the imported columns are normal except for the columns with the specified NULLIF and sysdate calculations. The imported columns without specified calculations are empty.

  • Column expression

    gs_loader supports expression conversion and scenario extension for specified columns.

    ({ column_name [ data_type ] [ AS transform_expr ] } [, ...])

    data_type specifies the data type of the column in the expression parameter. transform_expr specifies the target expression and returns the result value whose data type is the same as that of the target column in the table.

    Example:

    • The column type is not specified in the .ctl file, and the source data does not meet the column restrictions (data type and length restrictions) in the table.
      // Create a table.
      create table t_test(id int, text varchar(5));
      // Data file test.csv
      addf2,bbbbaaa,20220907,
      // Control file test.ctl
      Load Data
      TRUNCATE INTO TABLE t_test
      fields terminated by ','
      TRAILING NULLCOLS(
      id "length(trim(:id))",
      text "replace(trim(:text),'bbbb','aa')"
      )
      // guc_param file
      set a_format_copy_version='s1';
      // Import data.
      gs_loader -p xxx host=xxx control=test.ctl  data=test.csv -d testdb -W xxx guc_param=test_guc.txt
      // Result: Imported.
      select * from t_test;
       id | text  
      ----+-------
        5 | aaaaa
      (1 row)
    • The column type is not specified in the .ctl file, and the implicit type conversion is performed. (You are advised to add compatibility parameters because the implicit type conversion is involved.)
      // Create a table.
      create table test(mes int, mes1 text, mes2 float8, mes3 timestamp with time zone, mes4 INTEGER);
      // Data file
      cat load_support_transform.data
      1,mmoo,12.6789,Thu Jan 01 15:04:28 1970 PST,32767
      2,yyds,180.883,Thu Jun 21 19:00:00 2012 PDT,32768
      // Control file
      cat load_support_transform.ctl 
      Load Data
      TRUNCATE INTO TABLE test
      fields terminated by ','
      TRAILING NULLCOLS(
      mes,
      mes1 "mes1 || mes2",
      mes2 "mes2 + 1",
      mes3 "date_trunc('year', mes3)",
      mes4
      )
      // guc_param file
      cat test_guc.txt
      set a_format_copy_version='s1';
      set a_format_dev_version='s2';
      set a_format_version='10c';
      // Import data.
      gs_loader -p xxx host=xxx control=load_support_transform.ctl data=load_support_transform.data -d testdb -W xxx guc_param=test_guc.txt 
      // Result: Imported.
      select * from test;
       mes |    mes1     |  mes2   |          mes3          | mes4  
      -----+-------------+---------+------------------------+-------
         1 | mmoo12.6789 | 13.6789 | 1970-01-01 00:00:00+08 | 32767
         2 | yyds180.883 | 181.883 | 2012-01-01 00:00:00+08 | 32768
  • Data types

    Correspond to column_type [external] in the control file. During data loading, data is processed based on the data type. gs_loader classifies data types into common and special data types.

    • Common data types
      • CHAR [(length)]:

        Reads data based on a column separator and converts the value to the CHAR type. length indicates the maximum length of a single piece of data, in bytes. Generally, one character occupies one byte. The value can be left blank. The scenarios are as follows:

        • If a length is not declared, the value inherits the maximum length value declared by POSITION.
        • If a length is declared, it overwrites the maximum length declared by POSITION.
        • If neither length nor POSITION declares a length, the value is set based on the length between separators.
        • The priority of the length declaration is as follows: length > POSITION > separator.
        • The default length, POSITION, and delimiter are read from the current position to the line terminator.
        • If the actual data length exceeds the maximum value declared by length, an error is reported.
      • INTEGER external [(length)]:

        Reads data based on a column separator and converts the value to the INTEGER type. The rules for using length are the same as those described in "CHAR [(length)]."

      • FLOAT external [(length)]:

        Reads data based on a column separator and converts the value to the FLOAT type. The rules for using length are the same as those described in "CHAR [(length)]."

      • DECIMAL external (length):

        Reads data based on a column separator and converts the value to the DECIMAL type. The rules for using length are the same as those described in "CHAR [(length)]."

      • TIMESTAMP:

        Reads data based on a column separator and converts the value to the TIMESTAMP type.

      • DATE:

        Reads data based on a column separator and converts the value to the DATE type.

      • DATE external:

        Reads data based on a column separator and converts the value to the DATE type.

      • SYSDATE:

        Obtains the system time when the corresponding insertion is performed in the database. The value cannot be referenced. The referenced content is the SYSDATE character string.

    • Special data types
      • INTEGER:

        Ignores the column separator, reads four-byte characters, saves them based on the little-endian storage logic, parses each character into a hexadecimal ASCII code value, and converts the value into a decimal number.

      • SMALLINT:

        Ignores the column separator, reads two-byte characters, saves them based on the little-endian storage logic, parses each character into a hexadecimal ASCII code value, and converts the value into a decimal number.

        Example:

        // Create a table.
        create table t_spec(col1 varchar(10), col2 varchar(10));
        // Data file
        cat t_spec.txt
        1234,5678,
        // Control file
        cat t_spec.ctl
        Load Data
        TRUNCATE INTO TABLE t_spec
        fields terminated by ','
        TRAILING NULLCOLS(
        col1 position(2:6) integer,
        col2 position(5:8) smallint
        )
        // guc_param file
        cat test_guc.txt
        set a_format_copy_version='s1';
        set a_format_dev_version='s2';
        set a_format_version='10c';
        // Import data.
        gs_loader -p xxx host=xxx control=t_spec.ctl data=t_spec.txt -d testdb -W xxx guc_param=test_guc.txt
        // Result: Imported.
        select * from t_spec;
           col1    | col2
        -----------+-------
         741618482 | 13612
        (1 row)
      • RAW:

        Parses each character into an ASCII code value. The backslash (\) is not used as an escape character.

        Restriction: Separators cannot be used in RAW data.

        Example:

        // Create a table.
        create table t_raw(col raw(50));
        // Data file
        cat t_raw.txt
        12\n\x78!<~?'k^(%s)>/c[$50]
        // Control file
        cat t_raw.ctl
        Load Data
        TRUNCATE INTO TABLE t_raw
        TRAILING NULLCOLS(
        col position(1:50) raw
        )
        // guc_param file
        cat test_guc.txt
        set a_format_copy_version='s1';
        set a_format_dev_version='s2';
        set a_format_version='10c';
        // Import data.
        gs_loader -p xxx host=xxx control=t_raw.ctl data=t_raw.txt -d testdb -W xxx guc_param=test_guc.txt
        // Result: Imported.
        select * from t_raw;
                                  col
        --------------------------------------------------------
         31325C6E5C783738213C7E3F276B5E282573293E2F635B2435305D
        (1 row)
      • In the multi-column import scenario, if the GUC parameter is not specified, some positions and separators cannot be used at the same time.
      • In the multi-column import scenario, the SYSDATE and CONSTANT operations cannot be used together with the POSITION operation.
      • When importing data of a specified data type, you need to use guc_param to set a_format_copy_version for common data types and use guc_param to set a_format_copy_version, a_format_dev_version, and a_format_version for special data types.
      • If a column expression involves a system function, you need to use guc_param to set a_format_dev_version and a_format_version based on the corresponding function.
      • If the data type contains length, the value of length must be set to an integer greater than 0. The special data type RAW(length) is used differently from common types. For example, if POSITION is not specified for the common type INTEGER EXTERNAL(length), an error is reported when the value of length is less than the data length of the corresponding column in a text file (such as .csv or .txt). If the value of length is greater than the data length of the corresponding column in a text file (such as .txt), the result of the INTEGER EXTERNAL type is output. If POSITION is not specified for the special data type RAW(length), the first length characters are read.
      • If POSITION(start:end) is specified, the value of start must be set to an integer greater than 0, and the value of end must be greater than or equal to the value of start.
      • When POSITION is specified, spaces at the end of a column are not omitted when the column content is processed. If POSITION is not specified, spaces at the end of the column content will be omitted. To retain spaces, ensure that a_format_version has been set and add set behavior_compat_options='char_coerce_compat'; to the file specified by guc_param. For details, see "behavior_compat_options" in the Administrator Guide.
      • During concurrent import, if multiple names of files specified by discard or bad point to files with the same name in the same directory, gs_loader stops importing the next file and reports an error. If a previous file has been imported, the file will be overwritten.

        The following error is reported:

        ERROR: An error occurred. Please check logfile.

        In the log file:

        …lock failed: Resource temporarily unavailable…
      • If the column value in the control file is not empty and the column content is not used, the location of the data file is not occupied.

        For example, the control file is as follows:

        Load Data
        TRUNCATE INTO TABLE gsloader
        fields terminated by ','
        TRAILING NULLCOLS(
        id "trim(:id)",
        text "to_char(SYSDATE,'yyyymmdd')",
        gmt_create  "trim(:gmt_create)",
        create_str "trim(:create_str)"
        )

        The data file is as follows:

        11,22,33,

        The import result is as follows:

        loader=# select * from gsloader;
        id |  text  |     gmt_create      | create_str
        ----+--------+---------------------+------------
        11 | 2023-02-08 16:00:54 | 22 |  33