Guide to Importing Erroneous Data

The error tolerance mechanism during data import provides two modes.

Intelligent Correction Mode (Adaptive Import)

Principle: Prioritize data integrity and leverage intelligent correction to ensure that the highest possible volume of accurate data is imported.
Scenario: Apply this mode when the imported data contains column count exceptions (extra columns) or character exceptions.
Procedure:
1. Rectify the column count exceptions by truncating extra columns.
2. Transcode the character sets and clean illegal characters.
3. Write all data into the target database.
Output: corrected data set, along with logs recorded by GaussDB for character exceptions

Strict Verification Mode (Precise Import)

Principle: Prioritize data accuracy and enforce strict compliance when importing data into the database.
Scenario: Apply this mode in fields that demand strict data accuracy, particularly for sensitive data like medical records and financial transactions. This mode helps alleviate concerns about intelligent correction potentially compromising data accuracy.
Procedure:
1. Conduct a series of verifications at multiple levels (column count exceptions, character exceptions, data type conversion exceptions, and constraint conflicts).
2. Generate an error diagnosis report (which includes row numbers, error types, and error data).
3. Create an erroneous data isolation area.
4. Only import the original data that successfully passes these verifications into the database.
Output: pure data set and error details report (further details available in gs_copy_error_log and gs_copy_summary)

Error tolerance level: applies to all the exceptions processed by the intelligent correction mode, including extra, missing, or deprecated columns, data type conversion errors, overlong columns, and transcoding exceptions. The process is as follows:

      
           -- When the number of data type errors during data import does not exceed 100, no error will be reported, and GaussDB will proceed to the next row. However, if the number exceeds 100, an error will be reported. The details and row number of the erroneous data will be recorded in the gs_copy_error_log table.
gaussdb=#COPY test_copy FROM '/home/omm/temp/test.csv' log errors reject LIMIT '100' CSV;
-- In comparison to the previous statement, the next one includes an additional action: recording all data from the erroneous row in gs_copy_error_log. This action is recommended when there is no risk to data security. Note that SYSADMIN permissions are required.
gaussdb=#copy test_copy from '/home/omm/temp/test.csv' log errors data reject limit '100' csv;

Conclusion

To optimize data import processes, you can combine the intelligent correction mode with the strict verification mode, with the intelligent correction mode taking priority. When intelligent correction is enabled during a COPY import and it corrects a data item, strict verification will not be triggered for that specific row. Consequently, the error table will not record relevant data, nor will it update reject limit. Before importing data, carefully evaluate whether to automatically correct column and character exceptions or discard them, depending on specific requirements.

Parent topic: Best Practices for Import and Export Using COPY

Previous topic: Guide to Exporting Erroneous Data

Next topic: Best Practices for Import and Export Using Tools