Updated on 2025-04-14 GMT+08:00

gs_loader for Importing Data

Description

gs_loader is used to import data. gs_loader converts the syntax supported by the control file to the \COPY syntax, uses the existing \COPY function to import data, and records the \COPY result in logs.

gs_loader does not support M-compatible databases.

gs_loader supports PDBs.

During a rolling upgrade from an earlier version to this version, do not use the gs_loader tool before all nodes are upgraded.

Prerequisites

  • Before using gs_loader, ensure that the gs_loader version is consistent with the gsql version and database version.
  • After installing and deploying the gs_loader service, you need to add the tool path to PATH. gs_loader supports SSL encrypted communication. The method of using gs_loader is the same as that of using gsql.
  • Set the log level for developers to view. After the setting, the tool running information is printed on the console.
    export gs_loader_log_level=debug
    export gs_loader_log_level=info
    export gs_loader_log_level=warning
    export gs_loader_log_level=error

Installation and Deployment

Install and configure the gs_loader client tool on the server where source data files are stored so that you can use the gs_loader tool to import data.

  1. Create a directory for storing the gs_loader tool package.

    mkdir -p /opt/bin

  2. Upload the gsql tool package to the created directory.

    Taking the tool package for EulerOS as an example, upload the gsql tool package GaussDB-Kernel_Database version number_OS version number_64bit_gsql.tar.gz in the software installation package to the created directory.

  3. Go to the directory where the tool package is located and decompress the package.

    cd /opt/bin
    tar -zxvf GaussDB-Kernel_Database version number_OS version number_64bit_gsql.tar.gz
    source gsql_env.sh

  4. Verify the tool location and version information.

    which gs_loader

  5. Verify the client version information.

    The gs_loader tool version number corresponds to the gsql tool version number. You can directly query the gsql client version number to verify the client version information.
    gsql -V

  6. Verify that the database version is the same as the client tool version.

    Use gsql to connect to the database and run the following command:
    select version();

Permission

The application scenarios of gs_loader are classified into separation-of-duties and non-separation-of-duties scenarios. You can set GUC parameter enableSeparationOfDuty to on or off to enable or disable the separation of duties function.

The GUC parameter enable_copy_error_log controls whether to use the error table gs_copy_error_log and log table gs_copy_summary_log, and is set to off by default, indicating that the error table and log table are not used. Error records are directly recorded in the .bad file of gs_loader, and logs are directly recorded in the log file of gs_loader. If this parameter is set to on, the error table gs_copy_error_log and log table gs_copy_summary_log are used to insert error records into the error table. For details about the error table, see Table 1 Error table gs_copy_error_log. For details about the log table, see Table 2.

-- Enable separation of duties.
gs_guc set  -Z datanode -N all -I all -c "enableSeparationOfDuty = on"

-- Disable separation of duties.
gs_guc set  -Z datanode -N all -I all -c "enableSeparationOfDuty = off"

-- When enable_copy_error_log is set to on, the error table gs_copy_error_log is used.
gs_guc reload -Z datanode -N all -I all -c "enable_copy_error_log = on"
Table 1 Error table pg_catalog.gs_copy_error_log

Column

Type

Description

relname

text

Table name in the form of Schema name.Table name.

begintime

timestamp with time zone

Time when a data format error was reported.

filename

text

Name of the source data file where a data format error occurs.

lineno

bigint

Number of the row where a data format error occurs in a source data file.

rawrecord

text

Raw record of a data format error in the source data file.

detail

text

Error details.

custom_id

text

Value of copy_custom_id in a copy.

Table 2 Columns in the pg_catalog.gs_copy_summary_log table

Column

Description

relname

Name of the target table to be imported.

begintime

Start time of an import task.

endtime

End time of an import task.

id

ID of the import transaction.

pid

ID of the worker thread for the current import.

readrows

Total number of data rows read by the import task.

skiprows

Total number of data rows skipped in the import task.

loadrows

Number of data rows successfully imported in the current import task.

errorrows

Number of error data rows in the current import task.

whenrows

Number of data rows that violate the WHEN filter criterion in the current import task.

allnullrows

Number of data rows where all columns are empty.

detail

Summary of the import task, including the number of successfully imported rows, number of error data rows, number of rows that violate the WHEN condition, and number of blank rows.

custom_id

Value of copy_custom_id in a copy.

If enableSeparationOfDuty is set to off, the user can be a common database user or an administrator. If the user is a common user, administrators need to grant permissions to the common user. An administrator can use it directly.

  1. Create a user as an administrator.
    gaussdb=# CREATE USER load_user WITH PASSWORD '********';
  2. (Optional) Grant permissions on the pg_catalog.gs_copy_summary_log table to the new user.
    • The public.gs_copy_summary table used by gs_loader is changed to pg_catalog.gs_copy_summary_log. The original public.gs_copy_summary table is discarded. In addition, the system view pg_catalog.gs_copy_summary is added to query data in pg_catalog.gs_copy_summary_log for query compatibility. The name of the public.gs_copy_summary table may be the same as that of a user table. Therefore, exercise caution when migrating data to the pg_catalog.gs_copy_summary_log table and make sure the table is not a user table. After data migration is complete, delete public.gs_copy_summary. Because pg_catalog.gs_copy_summary exists, running the \d(+) gsql meta-command will preferentially find pg_catalog.gs_copy_summary, though public.gs_copy_summary is not displayed in the list because it is a system view. You can directly reference public.gs_copy_summary.
    • The pg_catalog.gs_copy_summary_log table is compatible only with the SELECT operation and is incompatible with the DELETE and TRUNCATE operations.
    gaussdb=# GRANT INSERT,SELECT,DELETE ON  pg_catalog.gs_copy_summary_log To load_user;
  3. (Optional) Grant permissions on the error table gs_copy_error_log to the new user.
    • The public.pgxc_copy_error_log table used by gs_loader is changed to pg_catalog.gs_copy_error_log. The original public.pgxc_copy_error_log table is discarded. In addition, the system view pg_catalog.pgxc_copy_error_log is added to query data in pg_catalog.gs_copy_error_log for query compatibility. The name of the public.pgxc_copy_error_log table may be the same as that of a user table. Therefore, exercise caution when migrating data to the pg_catalog.gs_copy_error_log table and make sure the table is not a user table. After data migration is complete, delete public.pgxc_copy_error_log. Because pg_catalog.pgxc_copy_error_log exists, running the \d(+) gsql meta-command will preferentially find pg_catalog.pgxc_copy_error_log, though it is not displayed in the list because it is a system view. You can directly reference public.pgxc_copy_error_log.
    • The pg_catalog.gs_copy_error_log table is compatible only with the SELECT operation and is incompatible with the DELETE and TRUNCATE operations.
    gaussdb=# GRANT INSERT,SELECT,DELETE ON  pg_catalog.gs_copy_error_log To load_user;

If enableSeparationOfDuty is set to on, the user can be a common database user or an administrator. Create the pgxc_copy_error_log and gs_copy_summary tables in their respective schemas and add indexes. No permission granting is required.

The pgxc_copy_error_log and gs_copy_summary tables always exist in the pg_catalog schema. In the setting of search_path, the pg_catalog schema takes precedence over user schemas. Therefore, to search for tables in a user schema by running the \d(+) gsql meta-command in the separation of duties scenario, explicitly specify the user schema.

  1. Create a user as the initial user.
    gaussdb=# CREATE USER load_user WITH PASSWORD '********';
  2. Switch to the new user as the initial user.
    gaussdb=# \c - load_user
  3. (Optional) Create a gs_copy_summary table and add an index.

    If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the log table and do not need to create it. Otherwise, you need to create the log table.

    gaussdb=# CREATE TABLE load_user.gs_copy_summary(relname varchar, begintime timestamptz, endtime timestamptz, id bigint, pid bigint, readrows bigint, skiprows bigint, loadrows bigint, errorrows bigint, whenrows bigint, allnullrows bigint, detail text);
    gaussdb=# CREATE INDEX gs_copy_summary_idx ON load_user.gs_copy_summary(id);
  4. (Optional) Create a pgxc_copy_error_log table and add an index.

    If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the error table and do not need to create it. Otherwise, you need to create the error table.

    -- Create the pgxc_copy_error_log table.
    gaussdb=# CREATE TABLE load_user.pgxc_copy_error_log (relname varchar, begintime timestamptz, filename varchar, lineno int8, rawrecord text, detail text);
    
    -- Create an index.
    gaussdb=# CREATE INDEX copy_error_log_relname_idx ON load_user.pgxc_copy_error_log(relname);

Format

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
LOAD [ DATA ]
[CHARACTERSET char_set_name]
[INFILE [directory_path] [filename ] ]
[BADFILE [directory_path] [filename ] ]
[OPTIONS(name=value)]
[{ INSERT | APPEND | REPLACE | TRUNCATE }] 
INTO TABLE table_name
[{ INSERT | APPEND | REPLACE | TRUNCATE }] 
[FIELDS CSV]
[TERMINATED [BY] { 'string' }]
[OPTIONALLY ENCLOSED BY { 'string' }]
[TRAILING NULLCOLS]
[ WHEN { (start:end) | column_name } {= | !=} 'string' ]
[(
col_name [ [ POSITION ({ start:end }) ]  ["sql_string"] ] | [ FILLER [column_type [external] ] ] | [ CONSTANT "string" ] | [ SEQUENCE ( { COUNT | MAX | integer } [, incr] ) ]|[NULLIF (COL=BLANKS)]
[, ...]
)]

Parameters

Table 3 gs_loader parameters

Parameter

Description

Parameter Type/Value Range

help

Displays help information.

-

user

Database connection user (equivalent to -U).

String

-U

Database connection user (equivalent to user).

String

passwd

User password (equivalent to -W).

String

-W

User password (equivalent to passwd).

String

db

Database name. This parameter is required and is equivalent to -d.

String

-d

Database name. This parameter is required and is equivalent to db.

String

host

Specifies the host name of the running server, the UDS path, or the domain name. You can specify multiple host addresses by using character strings separated by commas (,). This parameter is equivalent to -h.

If multiple host addresses are specified, the primary node is connected by default.

See Table 9.

-h

Specifies the host name of the running server, the UDS path, or the domain name. You can specify multiple host addresses by using character strings separated by commas (,). This parameter is equivalent to host.

If multiple host addresses are specified, the primary node is connected by default.

See Table 9.

port

Port number of the database server. One or more port numbers can be configured. When one port number is configured, all IP addresses use the same port for connection. When multiple port numbers are configured, the sequence is the same as the IP address sequence, and the number of port numbers must be the same as the number of IP addresses. If they are different, an error is reported. This parameter is equivalent to -p.

See Table 9.

-p

Port number of the database server. One or more port numbers can be configured. When one port number is configured, all IP addresses use the same port for connection. When multiple port numbers are configured, the sequence is the same as the IP address sequence, and the number of port numbers must be the same as the number of IP addresses. If they are different, an error is reported. This parameter is equivalent to port.

See Table 9.

create

Specifies whether to create the pgxc_copy_error_log and gs_copy_summary tables. In the current version, the two tables are created by default. Therefore, this parameter is meaningless. This parameter is reserved only for compatibility.

The value can be true or false. The default value is true.

clean

Specifies whether to clear the error record.

The value can be true or false. The default value is false.

data

(Required) Data file. You can specify multiple data files or use wildcards (*) and question marks (?) to represent multiple data files.

String

control

(Required) Name of a control file.

String

log

Name of a log file.

String

bad

Name of the file that records the error lines and details. You can also specify a directory. If you do not specify a directory, the file is generated based on the data file name.

String

discard

Name of the file recording the lines that fail to be matched by WHEN. You can also specify a directory to generate the file name based on the data file name.

String

errors

Maximum number of error lines in a data file.

Integer. Default value: 0.

skip

Number of first lines that can be skipped in a data file.

Integer. Default value: 0.

limit

Specifies the maximum number of rows that can be imported.

Integer. By default, the value is infinite.

bindsize

Only syntax compatibility is implemented, but functions are not implemented.

-

rows

Number of rows of data to be imported before a commit.

The value is an integer, in the range [1,2147483647].

compatible_nul

Specifies whether to enable the compatibility of NUL character (0x00) in data. After this function is enabled, if NUL characters exist in a data file, the NUL characters are converted to space characters (0x20), and then the data file is processed and imported.

The value can be true or false. The default value is true.

compatible_illegal_chars

Specifies whether to enable error tolerance for invalid characters. The error tolerance rules and restrictions are the same as those of COMPATIBLE_ILLEGAL_CHARS in the COPY syntax. For details, see the description about the COMPATIBLE_ILLEGAL_CHARS parameter in COPY_OPTION in "SQL Syntax > COPY" in Developer Guide.

The value can be true or false. The default value is false.

parallel

Specifies the degree of parallelism for data import. If the degree of parallelism is greater than 1, parallel import is enabled. If the degree of parallelism is 1, data is imported in serial mode. The maximum degree of parallelism cannot exceed twice the number of CPU cores on the client. When the client runs in a container, the number of CPUs obtained is the number of CPUs on the host, which may be greater than the number of CPUs that can be used by the container. You are advised to set the degree of parallelism to a value less than twice the number of CPUs that can be used by the client. In addition, this capability is implemented by multiple concurrent transactions through multiple threads. The actual degree of parallelism is limited by the thread pool model of the server and a high degree of parallelism will increase the pressure on the server. Set the degree of parallelism based on the actual situation.

The value is an integer ranging from 1 to twice the number of CPU cores. The default value is 1.

binary

Specifies whether the binary file is exported in COPY binary mode.

The value can be true or false. The default value is false.

  • All parameters are in lowercase and are compatible with the gsql login mode, including -p port number, -h host, -d database, -U username, and -W password.
  • OPTIONS, INFILE, and BADFILE are not supported. Syntax errors are not reported only in specific scenarios.
  • gs_loader uses a .bad file to record errors from the rawrecord column in an error table if the GUC parameter enable_copy_error_log is set to enable the error table. If a raw record cannot be read due to unrecognized coding, the record will be marked as a blank line in the .bad file.
  • When gs_loader sets the GUC parameter a_format_load_with_constraints_violation to support non-rollback upon constraint conflicts, if a table has a BEFORE/AFTER ROW INSERT trigger, a maximum of 10,000,000 rows can be committed at a time.
  • gs_loader does not support statement-level triggers when the GUC parameter a_format_load_with_constraints_violation is set to support non-rollback upon constraint conflicts.
  • When the rows parameter is specified, the number of commit times cannot exceed 1000. Otherwise, the performance is affected. The number of commit times is approximately equal to the number of data rows in the data file divided by the value of rows. If the rows parameter is not specified, there is no default value for rows. In this case, the transaction is committed only once after all data is imported to the table.
  • Frequent commit of a small amount of data affects the data import performance. You are advised to set the rows parameter properly to ensure that the amount of data committed each time is greater than 5 MB. For common 16 vCPU 128 GB servers, in the scenario where one primary node and two standby nodes are deployed and 13 GB data is imported to a table with five columns. The rate of multiple commits is about 10 MB/s, which is basically the same as that of a single commit (5 MB data is committed each time).
  • The compatible_nul parameter controls the value of the GUC parameter loader_support_nul_character.
    • compatible_nul=true corresponds to session-level set loader_support_nul_character='s2'.
    • compatible_nul=false corresponds to session-level set loader_support_nul_character='s1'.

      You are advised to set this parameter using the CLI. The priority of setting this parameter using compatible_nul is higher than that of GUC parameter setting.

  • Currently, gs_loader supports compatibility only when data files contain NUL characters. It does not support NUL characters in .ctl control files. If the .ctl file contains the NUL character, unexpected problems may occur.
  • When gs_loader is used to import data, if transcoding is not required, the maximum size of a single row of data is 1 GB minus 1 byte. If transcoding is required, the maximum size of a single row of data is 256 MB minus 1 byte. If the data volume in a single row is too large and the value of max_process_memory is too small, an error message is displayed, indicating that the memory is insufficient. In this case, you need to adjust the value of max_process_memory and try again.
  • It is recommended that the size of a single file to be imported be less than or equal to 1 GB. gs_loader has no limit on the size of a single file to be imported. However, importing a large file is time-consuming. Therefore, you are advised to split a large data file, start multiple gs_loader processes to append data to the table. (If the truncate operation is required, run the truncate command separately instead of writing the truncate operation to the control file.) When the CPU resources are sufficient, this method can effectively improve the import speed.
  • In a basic data migration scenario, you are advised to delete the index on the table and disable the trigger on the table. After the data migration is complete, rebuild the index and restore the trigger on the table. This is helpful to improve the import performance.
  • After the binary parameter is set to true, the following requirements must be met:
    • The data file must be a binary file exported in \COPY binary mode. However, the data file exported in this mode has poor compatibility and portability. You are advised to use \COPY to import the data file.
    • gs_loader converts the syntax in the control file to the simplest syntax in \COPY binary mode, that is, \COPY table_name FROM 'binary_file_path' BINARY;. Only the import mode, table name, as well as control, data, binary, guc_param, and database connection parameters in the control file are parsed. Other parameters are not parsed and do not take effect.
    • The command lines and control files of gs_loader must meet the following requirements:
      • Character set configuration is not supported.
      • The WHEN filter and DISCARD operation are not supported.
      • If enable_copy_error_log is set to off, error data cannot be directly written into .bad files, and log data cannot be directly written into log files. If enable_copy_error_log is set to on, error data cannot be directly written into the error table, and log data cannot be directly written into the log table.
      • The CSV mode is not supported, delimiters and wrappers cannot be specified, and the TRAILING NULLCOLS syntax is not supported.
      • Data type configuration, POSITION configuration, and column expression usage are not supported.
      • The FILLER, CONSTANT, SEQUENCE, and NULLIF parameters are not supported.
      • The skip, rows, compatible_nul, compatible_illegal_chars, and parallel parameters are not supported.
  • If parallel is set to a value greater than 1:
    • If binary is set to true, the setting of parallel becomes invalid and data is imported in serial mode.
    • OPTIONALLY ENCLOSED BY or FIELDS CSV cannot be set in the control file.
    • The SEQUENCE column cannot be set in the control file.
    • Data may not be imported in the sequence specified in the data file. If a table contains auto-increment columns, the sequence of values in the auto-increment columns may be different from that in the data file after the table is imported.
    • If the errors parameter is also used, the errors parameter indicates the maximum number of error lines allowed for each subtask.
    • If the skip parameter is also used, the skip parameter indicates the number of rows to be skipped at the beginning of the entire data file.
    • If the rows parameter is also used, the subtasks committed each batch are calculated independently.
    • When the CPU and memory resources of the client and the CPU and memory resources, idle threads, and network bandwidth of the server do not have bottlenecks and the total proportion of bad and discarded files does not exceed 1%, the performance is improved by at least 1.5, 3, or 5 times compared with serial import when the degree of parallelism is 2, 4, or 8.
    • Each time the degree of parallelism increases by 1, the required client memory increases by about 10 MB and the required server memory increases by about 35 MB.
  • If the GUC parameter support_zero_character is set to on, character 0x00 can be written to and read from the database. When gs_loader imports data, 0x00 is imported in the original style instead of being converted to 0x20 due to other compatibility parameters. Regarding the processing of 0x00, the parameter priority is as follows: support_zero_character > copy_special_character_version > compatible_illegal_chars > loader_support_nul_character.
  • When copy_special_character_version is set to 'no_error', the setting of copy_special_character_version is prior to that of compatible_illegal_chars. Invalid characters will be imported without conversion.
  • You are advised to use digits, letters, Chinese characters, and file path separators ('/') for parameter values; other characters may lead to shell command parsing anomalies.
  • CHARACTERSET

    Character set.

    Value range: a string.

    Note: The character set specified by CHARACTERSET in the control file must be the same as the encoding format of the file. Otherwise, an error is reported or garbled characters are displayed in the imported data.

  • INFILE

    The current keyword is invalid and needs to occupy a separate line in the control file. The keyword is ignored during running. You need to specify the corresponding data file in the gs_loader command line parameters.

  • BADFILE

    The current keyword is invalid and will be ignored during running. If no .bad file is specified in the gs_loader CLI, a .bad file will be generated based on the name of the corresponding control file.

  • OPTIONS

    Only the skip and rows parameters take effect. skip=n indicates that the first n records are skipped during import, and rows=n indicates the number of rows to be imported before a commit. If both the command line and control file are specified, the command line has a higher priority.

  • INSERT | APPEND | REPLACE | TRUNCATE

    Import mode.

    INSERT: If the table contains data, an error is reported.

    APPEND: Data is inserted directly.

    REPLACE: If the table contains data, all data is deleted and then inserted.

    TRUNCATE: If the table contains data, all data is deleted and then inserted.

    • When writing a control file (.ctl), you can specify the import mode (INSERT | APPEND | REPLACE | TRUNCATE) before and after the INTO TABLE table_name statement. The priority is as follows: The import mode specified after the statement takes precedence over and overwrites that specified before the statement.
    • When multiple gs_loader sessions are started to concurrently import data to the same table, you are advised to use the APPEND mode. If you use the INSERT, REPLACE, or TRUNCATE mode, an import error may occur or the imported data may be incomplete.
  • FIELDS CSV

    Specifies that the CSV mode of COPY is used. In CSV mode, the default separator is a comma (,), and the default quotation mark is a double quotation mark (").

    • In the current CSV mode, newlines that are enclosed with double quotation marks are considered as part of the column data.
    • In CSV mode, if GUC parameter a_format_copy_version is set to 's1', spaces at the beginning of column values are skipped. If the first non-space character in a column is not an enclosed character, the enclosure setting is ignored. If no enclosed character is matched and the end of the line is matched first, an error is reported.
    • In CSV mode, if GUC parameter support_zero_characters is disabled and compatible_nul or compatible_illegal_chars is used to be compatible with the 0x00 character, the 0x00 character at the beginning of column values is processed as 0x20 and deleted because the conversion from 0x00 to 0x20 occurs before the spaces at the beginning of column values are skipped.
  • table_name

    Specifies the name (possibly schema-qualified) of an existing table.

    Value range: an existing table name

  • TERMINATED [BY] { 'string' }

    The string that separates columns within each row (line) of the file, and it cannot be larger than 10 bytes.

    Value range: The value cannot include any of the following characters: \.abcdefghijklmnopqrstuvwxyz0123456789. The NUL character cannot be set as a separator.

    Value range: The default value is a tab character in text format and a comma in CSV format.

    After enabling NUL character compatibility (compatible_nul set to true), if the specified separator is a space character (0x20), note that the separator is the space character that exists in the data file instead of the space character converted from the NUL character.

  • OPTIONALLY ENCLOSED BY { 'string' }

    Specifies a quoted character string for a CSV file.

    The default value is double quotation marks (") only in CSV mode that is explicitly specified by the FIELDS CSV parameter.

    In other modes, there is no default value.

    • When you set OPTIONALLY ENCLOSED BY { 'string' }, there should be no quotation mark on the left of the data; otherwise, the number of quotation marks on either left or right must be an odd number but they do not have to be equal.
    • Currently, OPTIONALLY ENCLOSED BY { 'string' } is supported only in CSV mode. If OPTIONALLY ENCLOSED BY { 'string' } is specified, the system enters the CSV mode by default.
  • TRAILING NULLCOLS

    Specifies how to handle the problem that multiple columns of a row in a source data file are lost during data import.

    If one or more columns at the end of a row are null, the columns are imported to the table as null values. If this parameter is not set, an error message is displayed, indicating that the error column is null. In this case, the data in this row is processed as an error.

  • WHEN { (start:end) | column_name } {= | !=}

    Filters rows by character string between start and end or by column name.

    Value range: a string.

    • When the GUC parameter enable_copy_when_filler is set to on (default value), data can be filtered based on the FILLER column. When the GUC parameter enable_copy_when_filler is set to off, this usage is not supported.
    • The WHEN condition cannot be followed by special characters such as '\0' and '\r'.
  • POSITION ({ start:end })

    Processes columns and obtain the corresponding character strings between start and end.

    If POSITION is specified to use a fixed formatter in the \COPY statement after conversion, the newline characters in columns cannot be properly processed.

  • "sql_string"

    Processes columns and calculates column values based on column expressions. For details, see • Column expression.

    Value range: a string.

  • FILLER

    Processes columns. If FILLER occurs, this column is skipped.

    Currently, FILLER and POSITION ({ start:end }) cannot be used at the same time.

  • column_type [external]

    Processes the imported data according to different data types. For details, see • Data types.

  • CONSTANT

    Processes columns and sets the inserted columns to constants.

    Value range: a string.

  • SEQUENCE ( { COUNT | MAX | integer } [, incr] )

    Processes columns to generate the corresponding sequence values.

    • COUNT: The count starts based on the number of rows in the table.
    • MAX: The count starts from the maximum value of this column in the table.
    • integer: The count starts from the specified value.
    • incr: indicates the increment each time.
  • NULLIF

    Returns NULL if a_format_copy_version is set to 's1' and the data in the specified columns contains only whitespace characters; otherwise, it returns trim (COL), which is equivalent to the column expression "nullif(trim(COL), '')".

    Sets columns without NULLIF specified to null if a_format_copy_version is not set to 's1' in multi-row import scenarios and no sysdate, constant, position, or column expression is specified after the column names.

    Currently, only the COL POSITION() CHAR NULLIF (COL=BLANKS) syntax is supported. For details, see Example 2 and Example 3.

  • Column expression

    gs_loader supports expression conversion and scenario extension for specified columns.

    ({ column_name [ data_type ] [ AS transform_expr ] } [, ...])

    data_type specifies the data type of the column in the expression parameter. transform_expr specifies the target expression and returns the result value whose data type is the same as that of the target column in the table. For details, see Example 4 and Example 5.

  • Data types

    Correspond to column_type [external] in the control file. During data loading, data is processed based on the data type. gs_loader classifies data types into common and special data types.

    • Common data types
      • CHAR [(length)]:

        Reads data based on a column separator and converts the value to the CHAR type. length indicates the maximum length of a single piece of data, in bytes. Generally, one character occupies one byte. The value can be left blank. The scenarios are as follows:

        • If a length is not declared, the value inherits the maximum length value declared by POSITION.
        • If a length is declared, it overwrites the maximum length declared by POSITION.
        • If neither length nor POSITION declares a length, the value is set based on the length between separators.
        • The priority of the length declaration is as follows: length > POSITION > separator.
        • The default length, POSITION, and delimiter are read from the current position to the line terminator.
        • If the actual data length exceeds the maximum value declared by length, an error is reported.
      • INTEGER external [(length)]:

        Reads data based on a column separator and converts the value to the INTEGER type. The rules for using length are the same as those described in "CHAR [(length)]."

      • FLOAT external [(length)]:

        Reads data based on a column separator and converts the value to the FLOAT type. The rules for using length are the same as those described in "CHAR [(length)]."

      • DECIMAL external (length):

        Reads data based on a column separator and converts the value to the DECIMAL type. The rules for using length are the same as those described in "CHAR [(length)]."

      • TIMESTAMP:

        Reads data based on a column separator and converts the value to the TIMESTAMP type.

      • DATE:

        Reads data based on a column separator and converts the value to the DATE type.

      • DATEA:

        Reads data based on a column separator and converts the value to the DATEA type.

      • DATE external:

        Reads data based on a column separator and converts the value to the DATE type.

      • SYSDATE:

        Obtains the system time when the corresponding insertion is performed in the database. The value cannot be referenced. The referenced content is the SYSDATE character string.

    • Special data types
      • INTEGER:

        Ignores the column separator, reads four-byte characters, saves them based on the little-endian storage logic, parses each character into a hexadecimal ASCII code value, and converts the value into a decimal number.

      • SMALLINT:

        Ignores the column separator, reads two-byte characters, saves them based on the little-endian storage logic, parses each character into a hexadecimal ASCII code value, and converts the value into a decimal number. For details, see Example 6.

      • RAW:

        Parses each character into an ASCII code value. The backslash (\) is not used as an escape character. For details, see Example 7.

        Restriction: Separators cannot be used in RAW data.

      • In the multi-column import scenario, if the GUC parameter is not specified, some positions and separators cannot be used at the same time.
      • In the multi-column import scenario, the SYSDATE and CONSTANT operations cannot be used together with the POSITION operation.
      • When importing data of a specified data type, you need to use guc_param to set a_format_copy_version for common data types and use guc_param to set a_format_copy_version, a_format_dev_version, and a_format_version for special data types.
      • If a column expression involves a system function, you need to use guc_param to set a_format_dev_version and a_format_version based on the corresponding function.
      • If the data type contains length, the value of length must be set to an integer greater than 0. The special data type RAW(length) is used differently from common types. For example, if POSITION is not specified for the common type INTEGER EXTERNAL(length), an error is reported when the value of length is less than the data length of the corresponding column in a text file (such as .csv or .txt). If the value of length is greater than the data length of the corresponding column in a text file (such as .txt), the result of the INTEGER EXTERNAL type is output. If POSITION is not specified for the special data type RAW(length), the first length characters are read.
      • If POSITION(start:end) is specified, the value of start must be set to an integer greater than 0, and the value of end must be greater than or equal to the value of start.
      • When POSITION is specified, spaces at the end of a column are not omitted when the column content is processed. If POSITION is not specified, spaces at the end of the column content will be omitted. To retain spaces, ensure that a_format_version has been set and add set behavior_compat_options='char_coerce_compat'; to the file specified by guc_param. For details, see "behavior_compat_options" in the Administrator Guide.
      • During concurrent import, if multiple names of files specified by discard or bad point to files with the same name in the same directory, gs_loader stops importing the next file and reports an error. If a previous file has been imported, the file will be overwritten.

        The following error is reported:

        ERROR: An error occurred. Please check logfile.

        In the log file:

        …lock failed: Resource temporarily unavailable…
      • If the column value in the control file is not empty and the column content is not used, the location of the data file is not occupied.

        For example, the control file is as follows:

        Load Data
        TRUNCATE INTO TABLE gsloader
        fields terminated by ','
        TRAILING NULLCOLS(
        id "trim(:id)",
        text "to_char(SYSDATE,'yyyymmdd')",
        gmt_create  "trim(:gmt_create)",
        create_str "trim(:create_str)"
        )

        The data file is as follows:

        11,22,33,

        The import result is as follows:

        loader=# select * from gsloader;
        id |  text  |     gmt_create      | create_str
        ----+--------+---------------------+------------
        11 | 2023-02-08 16:00:54 | 22 |  33

Examples

Example 1

  1. (If the separation of duties function is disabled) For common users only:

    1. Create a user (as an administrator).
      gaussdb=# CREATE USER load_user WITH PASSWORD '********';
    2. (Optional) Grant permission on the gs_copy_summary_log table to the user (as an administrator).

      If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the log table and do not need to create it. Otherwise, you need to create the log table.

      gaussdb=# GRANT INSERT,SELECT,DELETE ON  pg_catalog.gs_copy_summary_log To load_user;
    3. (Optional) Grant permission on the gs_copy_error_log table to the user (as an administrator).

      If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the error table and do not need to create it. Otherwise, you need to create the error table.

      gaussdb=# GRANT INSERT,SELECT,DELETE ON  pg_catalog.gs_copy_error_log To load_user;
    4. Switch to another user.
      gaussdb=# \c - load_user

  2. (If the separation of duties function is enabled) For common users and administrators:

    1. Create a user (as the initial user).
      gaussdb=# CREATE USER load_user WITH PASSWORD '********';
    2. Switch to the load_user user (as the initial user).
      gaussdb=# \c - load_user
    3. (Optional) Create a gs_copy_summary table and add an index.

      If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the log table and do not need to create it. Otherwise, you need to create the log table.

      gaussdb=# CREATE TABLE load_user.gs_copy_summary(relname varchar, begintime timestamptz, endtime timestamptz, id bigint, pid bigint, readrows bigint, skiprows bigint, loadrows bigint, errorrows bigint, whenrows bigint, allnullrows bigint, detail text);
      gaussdb=# CREATE INDEX gs_copy_summary_idx ON load_user.gs_copy_summary(id);
    4. (Optional) Create a pgxc_copy_error_log table and add an index.

      If the GUC parameter enable_copy_error_log is not set (off by default) or is set to off, you do not need to use the error table and do not need to create it. Otherwise, you need to create the error table.

      gaussdb=# CREATE TABLE load_user.pgxc_copy_error_log (relname varchar, begintime timestamptz, filename varchar, lineno int8, rawrecord text, detail text);
      gaussdb=# CREATE INDEX copy_error_log_relname_idx ON load_user.pgxc_copy_error_log(relname);

  3. Create a table and a control file, and prepare a data file.

    Create the loader_tbl table.

    1
    2
    3
    4
    5
    6
    7
    gaussdb=# CREATE TABLE  loader_tbl
    (
        ID   NUMBER,
        NAME VARCHAR2(20),
        CON  VARCHAR2(20),
        DT   DATE
    );
    

    (On the gs_loader client) Create the control file loader.ctl.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    LOAD DATA
    truncate into table loader_tbl
    WHEN (2:2) = ',' 
    fields terminated by ','
    trailing nullcols
    (
        id integer external,
        name char(32),
        con ":id || '-' || :name",
        dt date
    )
    

    (On the gs_loader client) Create the GUC parameter file guc.txt.

    1
    set a_format_copy_version='s1';
    

    (On the gs_loader client) Create the data file data.csv.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    1,OK,,2007-07-8
    2,OK,,2008-07-8
    3,OK,,2009-07-8
    4,OK,,2007-07-8
    43,DISCARD,,2007-07-8
    ,,,
    32,DISCARD,,2007-07-8
    a,ERROR int,,2007-07-8
    8,ERROR date,,2007-37-8
    ,,,,
     ,
    8,ERROR fields,,2007-37-8
    ,,,
    5,OK,,2021-07-30
    

  4. Import the data.

    (On the gs_loader client) Before importing data, ensure that the gs_loader tool has the execute permission on the gs_loader tool. Ensure that the current path has the write permission on files. (The gs_loader generates some temporary files during the processing and automatically deletes them after the import is complete.)

    1
    gs_loader control=loader.ctl data=data.csv db=testdb bad=loader.bad guc_param=guc.txt errors=5 port=8000 passwd=******** user=load_user
    

    Execution result:

    1
    2
    3
    4
    5
    6
    gs_loader: version 0.1
     
     5 Rows successfully loaded. 
     
    log file is: 
     loader.log
    

gs_copy_summary is used to record the called COPY syntax and its details. [badfile]_bad.log is used to record error data and its details. To prevent the error data and details recorded during the last import from being overwritten, you are advised to use different bad parameters for each import. If the error table is used to record error data and details, enable the GUC parameter enable_copy_error_log. To delete data from a table, perform the TRUNCATE or DELETE operation on the error and log tables.

Example 2

If the data in the .bad file is empty, refer to the source file and row number in the error table. (The code sequence is not identified, the .bad file content is not written, and only blank rows are recorded.)

-- For the file corresponding to the loader, search for the first row of the data text to find the source data.
loader=# select * from pgxc_copy_error_log;
       relname        |           begintime           | filename | lineno | rawrecord |                     detail                      
----------------------+-------------------------------+----------+--------+-----------+-------------------------------------------------
 public.test_gsloader | 2023-02-09 09:20:33.646843-05 | STDIN    |      1 |           | invalid byte sequence for encoding "UTF8": 0xb4
(1 row)

Example 3

NULLIF use cases:

-- Create the gsloader_test_nullif table.
gaussdb=# create table gsloader_test_nullif(
col1   varchar2(100) not null enable,
col2   number(5,0) not null enable,
col3   varchar2(200) not null enable,
col4   varchar2(34) not null enable,
col5   varchar2(750),
col6   number(20,0),
col7   varchar2(4000),
col8   varchar2(200)
);

-- View the data file test.csv.
6007 17060072021-09-0360070001102010000000230          1        600700010000218               0        1        1        229465        3
6007 17060072021-09-0360070001102010000000299          1        600700010000282               0        1        1        230467        3
6007 17060072021-09-0360070001102010000000242          1        600700010000255               0        1        1        226400        3
6007 17060072021-09-0360070001102010000000202          1        600700010000288               0        1        1        219107        3
6007 17060072021-09-0360070001102010000000294          1        600700010000243               0        1        1        204404        3
6007 17060072021-09-0360070001102010000000217          1        600700010000270               0        1        1        226644        3

-- View the control file test.ctl.
LOAD DATA  
CHARACTERSET UTF8
TRUNCATE
INTO TABLE gsloader_test_nullif
TRAILING NULLCOLS
(COL1 POSITION(1:10) CHAR NULLIF (COL1 = BLANKS),
COL2  POSITION(11:14) CHAR NULLIF (COL2 = BLANKS),
COL3  POSITION(21:30) CHAR NULLIF (COL3 = BLANKS),
COL4  POSITION(31:40) CHAR NULLIF (COL4 = BLANKS),
COL5  sysdate,
COL6,
COL7, 
COL8 POSITION(71:80) CHAR NULLIF (COL8 = BLANKS))

-- Import data.
gs_loader -p xxx host=xxx control=test.ctl  data=test.csv -d testdb -W xxx 

-- View the import result. The import is successful.
loader=# select * from gsloader_test_nullif;
    col1    | col2 |    col3    |    col4    |        col5         | col6 | col7 |   col8
------------+------+------------+------------+---------------------+------+------+-----------
 6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000218
 6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000282
 6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000255
 6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000288
 6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000243
 6007 17060 |  720 | 0360070001 | 1020100000 | 2023-05-17 20:32:03 |      |      | 010000270
(6 rows)

According to the data in the imported table, after the NULLIF keyword is used, the imported columns are normal except for the columns with the specified NULLIF and sysdate calculations. The imported columns without specified calculations are empty.

Example 4

The column type is not specified in the .ctl file, and the source data does not meet the column restrictions (data type and length restrictions) in the table.

-- Create a table.
gaussdb=# create table t_test(id int, text varchar(5));

-- View the data file test.csv.
addf2,bbbbaaa,20220907,

-- View the control file test.ctl.
Load Data
TRUNCATE INTO TABLE t_test
fields terminated by ','
TRAILING NULLCOLS(
id "length(trim(:id))",
text "replace(trim(:text),'bbbb','aa')"
)

-- Use guc_param to set the a_format_copy_version parameter.
cat test_guc.txt
set a_format_copy_version='s1';

-- Import data.
gs_loader -p xxx host=xxx control=test.ctl  data=test.csv -d testdb -W xxx guc_param=test_guc.txt

-- View the import result. The import is successful.
gaussdb=# select * from t_test;
 id | text  
----+-------
  5 | aaaaa
(1 row)

Example 5

The column type is not specified in the .ctl file, and the implicit type conversion is performed. (You are advised to add compatibility parameters because the implicit type conversion is involved.)

-- Create a table.
gaussdb=# create table test(mes int, mes1 text, mes2 float8, mes3 timestamp with time zone, mes4 INTEGER);

-- View the data file.
cat load_support_transform.data
1,mmoo,12.6789,Thu Jan 01 15:04:28 1970 PST,32767
2,yyds,180.883,Thu Jun 21 19:00:00 2012 PDT,32768

-- View the control file.
cat load_support_transform.ctl 
Load Data
TRUNCATE INTO TABLE test
fields terminated by ','
TRAILING NULLCOLS(
mes,
mes1 "mes1 || mes2",
mes2 "mes2 + 1",
mes3 "date_trunc('year', mes3)",
mes4
)

-- Use guc_param to set the a_format_copy_version parameter.
cat test_guc.txt
set a_format_copy_version='s1';
set a_format_dev_version='s2';
set a_format_version='10c';

-- Import data.
gs_loader -p xxx host=xxx control=load_support_transform.ctl data=load_support_transform.data -d testdb -W xxx guc_param=test_guc.txt 

-- View the import result. The import is successful.
gaussdb=# select * from test;
 mes |    mes1     |  mes2   |          mes3          | mes4  
-----+-------------+---------+------------------------+-------
   1 | mmoo12.6789 | 13.6789 | 1970-01-01 00:00:00+08 | 32767
   2 | yyds180.883 | 181.883 | 2012-01-01 00:00:00+08 | 32768

Example 6

-- Create a table.
gaussdb=# create table t_spec(col1 varchar(10), col2 varchar(10));

-- View the data file.
cat t_spec.txt
1234,5678,

-- View the control file.
cat t_spec.ctl
Load Data
TRUNCATE INTO TABLE t_spec
fields terminated by ','
TRAILING NULLCOLS(
col1 position(2:6) integer,
col2 position(5:8) smallint
)

-- Use guc_param to set the a_format_copy_version parameter.
cat test_guc.txt
set a_format_copy_version='s1';
set a_format_dev_version='s2';
set a_format_version='10c';

-- Import data.
gs_loader -p xxx host=xxx control=t_spec.ctl data=t_spec.txt -d testdb -W xxx guc_param=test_guc.txt

-- View the import result. The import is successful.
gaussdb=# select * from t_spec;
   col1    | col2
-----------+-------
 741618482 | 13612
(1 row)

Example 7

-- Create a table.
gaussdb=# create table t_raw(col raw(50));

-- View the data file.
cat t_raw.txt
12\n\x78!<~?'k^(%s)>/c[$50]

-- View the control file.
cat t_raw.ctl
Load Data
TRUNCATE INTO TABLE t_raw
TRAILING NULLCOLS(
col position(1:50) raw
)

-- Use guc_param to set the a_format_copy_version parameter.
cat test_guc.txt
set a_format_copy_version='s1';
set a_format_dev_version='s2';
set a_format_version='10c';

-- Import data.
gs_loader -p xxx host=xxx control=t_raw.ctl data=t_raw.txt -d testdb -W xxx guc_param=test_guc.txt

-- View the import result. The import is successful.
gaussdb=# select * from t_raw;
                          col
--------------------------------------------------------
 31325C6E5C783738213C7E3F276B5E282573293E2F635B2435305D
(1 row)