Guide to Importing Erroneous Data

The error tolerance mechanism during data import provides two modes.

Intelligent Correction Mode (Adaptive Import)

Principle: Prioritize data integrity and leverage intelligent correction to ensure that the highest possible volume of accurate data is imported.
Scenario: Apply this mode when the imported data contains column count exceptions, including extra, missing, or deprecated columns, as well as abnormal characters.
Procedure:
1. Rectify the column count exceptions by truncating extra columns, padding missing columns, and discarding deprecated columns.
2. Transcode the character sets and clean illegal characters.
3. Write all data into the target database.
Output: corrected data set, along with logs recorded by GaussDB for character exceptions

Example:

Extra columns: The number of data columns is greater than that of table columns. In this situation, you can specify the ignore_extra_data option in the COPY statement, and GaussDB will import data from the correct number of columns into the table while discarding the extra columns.

gaussdb=#create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=#\c db1
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db1" as user "omm".
db1=# create table test_copy(id int, content text);
CREATE TABLE
db1=# copy test_copy from stdin delimiter ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,Import,Export
>>\. 
ERROR:  extra data after last expected column
CONTEXT:  COPY test_copy, line 1: "1,Import,Export"
-- When ignore_extra_data is not specified, the import fails. However, specifying it ensures a successful import.
db1=# copy test_copy from stdin delimiter ',' ignore_extra_data;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,Import,Export
>> \.
COPY 1
db1=# select * from test_copy;
 id | content 
----+---------
  1 | Import
(1 row)

Missing columns: The number of data columns is less than that of table columns. In this situation, you can specify the fill_missing_fields option in the COPY statement, and GaussDB will import data into the matched table columns. If columns do not match any of the imported data, they will be padded with their default values if available. However, if no default values are available, they will be padded with NULL instead.

        
         
           
           
             gaussdb=#create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=#\c db1
db1=# create table test_copy(id int, content text, comment1 text, comment2 text);
CREATE TABLE
db1=# copy test_copy from stdin delimiter ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,Import
>> \.
ERROR:  missing data for column "comment1"
CONTEXT:  COPY test_copy, line 1: "1,Import"
-- When fill_missing_fields is not specified, the import fails. However, specifying it ensures a successful import.
db1=# copy test_copy from stdin delimiter ',' fill_missing_fields 'multi';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,Import
>> \.
COPY 1
db1=# select * from test_copy;
 id | content | comment1 | comment2 
----+---------+----------+----------
  1 | Import    |          | 
(1 row)

            

          

        
       

Deprecated columns: Some data columns do not need to be imported into the database. In this situation, you can specify the FILLER option in the COPY statement, and GaussDB will skip the specified columns. If the corresponding columns are present in the database, they will be padded with their default values if available. However, if no default values are available, they will be padded with NULL instead.

        
         
           
           
             gaussdb=#create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=#\c db1
db1=# create table test_copy(id int, comment text default 'Export', content text);
CREATE TABLE
db1=# copy test_copy from stdin delimiter ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>1,No need to import,Import
>\.    
COPY 1
db1=# select * from test_copy;
 id |  comment   | content 
----+------------+---------
  1 | No need to import | Import
(1 row)
-- When the filler(comment) option is not specified, the content in the second column will be imported into the comment column. However, when the option is specified, the content will not be imported. In this example, the option is specified, and the default value will be inserted into the comment column.
db1=# copy test_copy(id,comment,content) from stdin delimiter ',' filler (comment);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,No need to import,Import
>> \.
COPY 1
db1=# select * from test_copy;
 id |  comment   | content 
----+------------+---------
  1 | No need to import | Import
  1 | Export       | Import
(2 rows)

            

          

        
       

Character exceptions: When dealing with character exceptions, take appropriate actions based on the consistency of server-side encoding with client-side encoding.

Consistent encoding: If the data being imported contains any characters encoded in an illegal format or zero characters, consider setting the GUC parameter copy_special_character_version to 'no_error' for error tolerance. In this setting, GaussDB will accept data that does not comply with the encoding format. Instead of reporting an error, it will directly insert the data into the table according to the original encoding format.

For details, see the example data file, /home/xy/encodings.txt.utf8, generated in Guide to Exporting Erroneous Data. To simulate the scenario where files with encoding exceptions are not transcoded during import, you can create a database with UTF-8 encoding.

          
           
             
             
               gaussdb=#create database db_utf8 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=#\c db_utf8
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db_utf8" as user "omm".
db_utf8=# create table test_encodings(id int, content text);
CREATE TABLE
db_utf8=# show copy_special_character_version;
 copy_special_character_version 
--------------------------------

(1 row)
db_utf8=# copy test_encodings from '/home/omm/temp/encodings.txt.utf8';
ERROR:  invalid byte sequence for encoding "UTF8": 0x97
CONTEXT:  COPY test_encodings, line 2
db_utf8=# set copy_special_character_version = 'no_error';
SET
db_utf8=# copy test_encodings from '/home/xy/encodings.txt.utf8';
COPY 2
db_utf8=# select * from test_encodings;
 id |  content  
----+-----------
  1 | Import
  2 | "Import/Export
(2 rows)

              

            

          
         

Inconsistent encoding: Transcoding is required if the server-side encoding does not match the client-side encoding. COPY implements transcoding through the compatible_illegal_chars parameter. If illegal characters are imported into GaussDB, the error tolerance mechanism will convert them and store the resulting characters in GaussDB. The entire import process will proceed without any errors or interruptions. This ensures efficient and stable data import, even in complex environments with inconsistent encoding.

Strict Verification Mode (Precise Import)

Principle: Prioritize data accuracy and enforce strict compliance when importing data into the database.
Scenario: Apply this mode in fields that demand strict data accuracy, particularly for sensitive data like medical records and financial transactions. This mode helps alleviate concerns about intelligent correction potentially compromising data accuracy.
Procedure:
1. Conduct a series of verifications at multiple levels (column count exceptions, character exceptions, data type conversion exceptions, and constraint conflicts).
2. Generate an error diagnosis report (which includes row numbers, error types, and error data).
3. Create an erroneous data isolation area.
4. Only import the original data that successfully passes these verifications into the database.
Output: pure data set and error details report (further details available in gs_copy_error_log and gs_copy_summary)

Error tolerance levels:

Level 1 error tolerance: applies to all the exceptions processed by the intelligent correction mode, including extra data source columns, data type conversion errors, overlong columns, and transcoding exceptions. The process is as follows:

        
             -- When the number of data type errors during data import does not exceed 100, no error will be reported, and GaussDB will proceed to the next row. However, if the number exceeds 100, an error will be reported. The details and row number of the erroneous data will be recorded in the gs_copy_error_log table.
gaussdb=#copy test_copy from '/home/omm/temp/test.csv' log errors reject limit '100' csv;
-- In comparison to the previous statement, the next one includes an additional action: recording all data from the erroneous row in gs_copy_error_log. This action is recommended when there is no risk to data security. Note that SYSADMIN permissions are required.
gaussdb=#copy test_copy from '/home/omm/temp/test.csv' log errors data reject limit '100' csv;

Level 2 error tolerance: Building on Level 1, Level 2 extends support to address constraint conflicts in data, including NOT NULL constraints, conditional constraints, PRIMARY KEY constraints, UNIQUE constraints, and unique index constraints. The process is as follows:

        
             -- To support a new error type, constraint conflict, while keeping the same COPY statement and error tolerance logic as Level 1, set the GUC parameter as follows:
gaussdb=#set a_format_load_with_constraints_violation = 's2';
gaussdb=#copy test_copy from '/home/omm/temp/test.csv' log errors data reject limit '100' csv;

Example:

Level 1 error tolerance:

        
         
           
           
             gaussdb=#create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=#\c db1
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db1" as user "omm".
db1=# create table t1(a int primary key, b char(2) not null, c text check(length(c) < 2), comment text);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "t1_pkey" for table "t1"
CREATE TABLE
db1=# copy t1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1	a	a	Expected normal import
>> a	b		Type error in column 1, Level 1 error tolerance, Record in error table
>> 3	ccc		Overlong column 2, Level 1 error tolerance, Record in error table
>> 4	d	d	Extra column 5, Level 1 error tolerance, Record in error table	e
>> \.
ERROR:  invalid input syntax for integer: "a"
CONTEXT:  COPY t1, line 2, column a: "a"
db1=# copy t1 from stdin log errors data reject limit '100';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1	a	a	Expected normal import
>> a	b		Type error in column 1, Level 1 error tolerance, Record in error table
>> 3	ccc		Overlong column 2, Level 1 error tolerance, Record in error table
>> 4	d	d	Extra column 5, Level 1 error tolerance, Record in error table	e
>> \.
db1=# select * from t1;
 a | b  | c |   comment    
---+----+---+--------------
 1 | a  | a | Expected normal import
(1 row)

db1=# select * from pgxc_copy_error_log;
  relname  |           begintime           | filename | lineno |                             rawrecord                             |       
         detail                 
-----------+-------------------------------+----------+--------+-------------------------------------------------------------------+-------
--------------------------------
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      2 | a       b               Type error in column 1, Level 1 error tolerance, Record in error table     | invali
d input syntax for integer: "a"
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      3 | 3       ccc             Overlong column 2, Level 1 error tolerance, Record in error table     | value 
too long for type character(2)
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      4 | 4       d       d       Extra column 5, Level 1 error tolerance, Record in error table     e | extra 
data after last expected column
(3 rows)

            

          

        
       

Level 2 error tolerance:

gaussdb=#create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=#\c db1
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db1" as user "omm".
db1=# create table t1(a int primary key, b char(2) not null, c text check(length(c) < 2), comment text);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "t1_pkey" for table "t1"
CREATE TABLE
db1=# copy t1 from stdin log errors data reject limit '100';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1	a	a	Expected normal import
>> a	b		Type error in column 1, Level 1 error tolerance, Record in error table
>> 3	ccc		Overlong column 2, Level 1 error tolerance, Record in error table
>> 4	d	d	Extra column 5, Level 1 error tolerance, Record in error table	e
>> 1	a		PRIMARY KEY constraint conflict in column 1, Level 2 error tolerance, Record in error table
>> 1			NOT NULL constraint conflict in column 2, Level 2 error tolerance, Record in error table
>> 1	a	aaa	CHECK constraint conflict in column 3, Level 2 error tolerance, Record in error table
>> \.
ERROR:  The null value in column "b" violates the not-null constraint.
DETAIL:  Failing row contains (1, null, null, NOT NULL constraint conflict in column 2, Level 2 error tolerance, Record in error table).
CONTEXT:  COPY t1, line 6: "1			NOT NULL constraint conflict in column 2, Level 2 error tolerance, Record in error table"
db1=# set a_format_load_with_constraints_violation = 's2';
SET
db1=# copy t1 from stdin log errors data reject limit '100';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1	a	a	Expected normal import
>> a	b		Type error in column 1, Level 1 error tolerance, Record in error table
>> 3	ccc		Overlong column 2, Level 1 error tolerance, Record in error table
>> 4	d	d	Extra column 5, Level 1 error tolerance, Record in error table	e
>> 1	a		PRIMARY KEY constraint conflict in column 1, Level 2 error tolerance, Record in error table
>> 1			NOT NULL constraint conflict in column 2, Level 2 error tolerance, Record in error table
>> 1	a	aaa	CHECK constraint conflict in column 3, Level 2 error tolerance, Record in error table
>> \.
db1=# select * from t1;
 a | b  | c |   comment    
---+----+---+--------------
 1 | a  | a | Expected normal import
(1 row)

db1=# select * from pgxc_copy_error_log;
  relname  |           begintime           | filename | lineno |                             rawrecord                              |      
                        detail                              
-----------+-------------------------------+----------+--------+--------------------------------------------------------------------+------
------------------------------------------------------------
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      2 | a       b               Type error in column 1, Level 2 error tolerance, Record in error table      | inval
id input syntax for integer: "a"
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      3 | 3       ccc             Overlong column 2, Level 2 error tolerance, Record in error table      | value
 too long for type character(2)
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      4 | 4       d       d       Extra column 5, Level 2 error tolerance, Record in error table     e  | extra
 data after last expected column
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      2 | a       b               Type error in column 1, Level 1 error tolerance, Record in error table      | inval
id input syntax for integer: "a"
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      3 | 3       ccc             Overlong column 2, Level 1 error tolerance, Record in error table      | value
 too long for type character(2)
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      4 | 4       d       d       Extra column 5, Level 1 error tolerance, Record in error table     e  | extra
 data after last expected column
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      6 | 1                       NOT NULL constraint conflict in column 2, Level 2 error tolerance, Record in error table  | The n
ull value in column "b" violates the not-null constraint.
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      5 | 1                       NOT NULL constraint conflict in column 2, Level 2 error tolerance, Record in error table  | Dupli
cate key value violates unique constraint "t1".
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      7 | 1       a       aaa     CHECK constraint conflict in column 3, Level 2 error tolerance, Record in error table | New r
ow in relation "t1" violates check constraint "t1_c_check".
(9 rows)

Conclusion

To optimize data import processes, you can combine the intelligent correction mode with the strict verification mode, with the intelligent correction mode taking priority. When intelligent correction is enabled during a COPY import and it corrects a data item, strict verification will not be triggered for that specific row. Consequently, the error table will not record relevant data, nor will it update reject limit. Before importing data, carefully evaluate whether to automatically correct column and character exceptions or discard them, depending on specific requirements.

For strict verification mode, Level 1 is recommended by default, as it effectively identifies the most common errors without compromising import performance. However, if you are operating in centralized A-compatible mode, you can consider Level 2. Be mindful that Level 2 can degrade import performance and consume additional memory resources. Therefore, do not enable Level 2 by default and enable it only when there are constraint conflicts in your data. For more details, refer to the error tolerance capabilities of COPY.

Parent topic: Best Practices for Import and Export Using COPY

Previous topic: Guide to Exporting Erroneous Data

Next topic: Best Practices for Import and Export Using Tools