Guide to Importing Erroneous Data

The error tolerance mechanism during data import provides two modes.

Intelligent Correction Mode (Adaptive Import)

Principle: Prioritize data integrity and leverage intelligent correction to ensure that the highest possible volume of accurate data is imported.
Scenario: Apply this mode when the imported data contains column count exceptions (extra columns) or character exceptions.
Procedure:
1. Rectify the column count exceptions by truncating extra columns.
2. Transcode the character sets and clean illegal characters.
3. Write all data into the target database.
Output: corrected data set, along with logs recorded by GaussDB for character exceptions

Example:

Extra columns: The number of data columns is greater than that of table columns. In this situation, you can specify the ignore_extra_data option in the COPY statement, and GaussDB will import data from the correct number of columns into the table while discarding the extra columns.

gaussdb=# create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'ORA';
CREATE DATABASE
gaussdb=# \c db1
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db1" as user "omm".
db1=# create table test_copy(id int, content text);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'id' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE TABLE
db1=# copy test_copy from stdin delimiter ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,Import,Export
>>\. 
ERROR:  extra data after last expected column
CONTEXT:  COPY test_copy, line 1: "1,Import,Export"
-- When ignore_extra_data is not specified, the import fails. However, specifying it ensures a successful import.
db1=# copy test_copy from stdin delimiter ',' ignore_extra_data;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,Import,Export
>> \.
COPY 1
db1=# select * from test_copy;
 id | content 
----+---------
  1 | Import
(1 row)

Character exceptions: When dealing with character exceptions, take appropriate actions based on the consistency of server-side encoding with client-side encoding.

Consistent encoding: If the data being imported contains any characters encoded in an illegal format, consider setting the GUC parameter copy_special_character_version to 'no_error' for error tolerance. In this setting, GaussDB will accept data that does not comply with the encoding format. Instead of reporting an error, it will directly insert the data into the table according to the original encoding format.

For details, see the example data file, /home/xy/encodings.txt.utf8, generated in Guide to Exporting Erroneous Data. To simulate the scenario where files with encoding exceptions are not transcoded during import, you can create a database with UTF-8 encoding.

          
           
             
             
               gaussdb=# create database db_utf8 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'ORA';
CREATE DATABASE
gaussdb=# \c db_utf8
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db_utf8" as user "omm".
db_utf8=# create table test_encodings(id int, content text);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'id' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE TABLE
db_utf8=# show copy_special_character_version;
 copy_special_character_version 
--------------------------------

(1 row)
db_utf8=# copy test_encodings from '/home/omm/temp/encodings.txt.utf8';
ERROR:  invalid byte sequence for encoding "UTF8": 0x97
CONTEXT:  COPY test_encodings, line 2
db_utf8=# set copy_special_character_version = 'no_error';
SET
db_utf8=# copy test_encodings from '/home/xy/encodings.txt.utf8';
COPY 2
db_utf8=# select * from test_encodings;
 id |  content  
----+-----------
  1 | Import
  2 | "Import/Export
(2 rows)

              

            

          
         

Inconsistent encoding: Transcoding is required if the server-side encoding does not match the client-side encoding. COPY implements transcoding through the compatible_illegal_chars parameter. If illegal characters are imported into GaussDB, the error tolerance mechanism will convert them and store the resulting characters in GaussDB. The entire import process will proceed without any errors or interruptions. This ensures efficient and stable data import, even in complex environments with inconsistent encoding.

Strict Verification Mode (Precise Import)

Principle: Prioritize data accuracy and ensure the standardization of imported data.
Scenario: Apply this mode in fields that demand strict data accuracy, particularly for sensitive data like medical records and financial transactions. This mode helps alleviate concerns about intelligent correction potentially compromising data accuracy.
Procedure:
1. Conduct a series of verifications at multiple levels (column count exceptions, character exceptions, data type conversion exceptions, and constraint conflicts).
2. Generate an error diagnosis report (which includes row numbers, error types, and error data).
3. Create an erroneous data isolation area.
4. Only import the original data that successfully passes these verifications into the database.
Output: pure data set and error details report (further details available in pg_catalog.pgxc_copy_error_log)

Error tolerance level: applies to all the exceptions processed by the intelligent correction mode, including extra columns, data type conversion errors, overlong columns, and transcoding exceptions. The process is as follows:

      
           -- When the number of data type errors during data import does not exceed 100, no error will be reported, and GaussDB will proceed to the next row. However, if the number exceeds 100, an error will be reported. The details and row number of the erroneous data will be recorded in the gs_copy_error_log table.
gaussdb=# copy test_copy from '/home/omm/temp/test.csv' log errors reject limit '100' csv;
-- In comparison to the previous statement, the next one includes an additional action: recording all data from the erroneous row in gs_copy_error_log. This action is recommended when there is no risk to data security. Note that SYSADMIN permissions are required.
gaussdb=# copy test_copy from '/home/omm/temp/test.csv' log errors data reject limit '100' csv;

Conclusion

To optimize data import processes, you can combine the intelligent correction mode with the strict verification mode, with the intelligent correction mode taking priority. When intelligent correction is enabled during a COPY import and it corrects a data item, strict verification will not be triggered for that specific row. Consequently, the error table will not record relevant data, nor will it update reject limit. Before importing data, carefully evaluate whether to automatically correct column and character exceptions or discard them, depending on specific requirements.

Parent topic: Best Practices for Import and Export Using COPY (Distributed Instances)

Previous topic: Guide to Exporting Erroneous Data

Next topic: Best Practices for Import and Export Using COPY (Centralized Instances)