Guide to Exporting Erroneous Data

Data errors during export typically occur when character strings or binary data that does not match the server-side encoding is inserted into the database. To address this, you are advised to keep the client-side encoding consistent with the server-side encoding, eliminating the need for validity checks against the server-side encoding and data transcoding.

Encoding Consistency Principle During Export

When the client-side encoding is consistent with the server-side encoding:
- Native data is exported.
- Data integrity and originality are guaranteed.
- Character set conversion is not required.
When the client-side encoding is inconsistent with the server-side encoding:
- The client-side encoding is employed as the target encoding standard for the exported files.
- The kernel first checks existing data against the server-side encoding. Upon detecting any data encoded in an illegal format, it will report an error.
- The kernel then proceeds to transcode the data. If it encounters any characters that cannot be transcoded (due to code bits present in the source character set but not in the target character set), it will report an error.

Solutions to Illegal Encoding

If your database contains any data encoded in an illegal format and you wish to export the data without triggering an error, consider the following methods:

Preferred method: Keep the client-side encoding consistent with the server-side encoding. Then export data using the server-side encoding without performing any transcoding.

Query the database server-side encoding.
```
gaussdb=# show server_encoding;
```
Query the database client-side encoding.
```
gaussdb=# show client_encoding;
```
Keep the client-side encoding consistent with the server-side encoding.
```
gaussdb=# set client_encoding = '{server_encoding}';
```
Execute COPY to export data to a file in standard CSV format.
```
gaussdb=# COPY test_copy TO '/data/test_copy.csv' CSV;
```

Alternative solution: Use placeholders ('?') to replace any bytes encoded in an illegal format. This solution depends on the transcoding capability of the database kernel and will alter the content of the exported data.

Query the database server-side encoding.
```
gaussdb=# show server_encoding;
```
Set the database client-side encoding as the target encoding.
```
gaussdb=# set client_encoding = {target_encoding};
```
Leverage the kernel's transcoding capability to replace any bytes encoded in an illegal format while exporting data.
```
gaussdb=# COPY test_copy TO '/data/test_copy.csv' CSV COMPATIBLE_ILLEGAL_CHARS;
```

Consider enabling the COMPATIBLE_ILLEGAL_CHARS parameter to correct any exported data encoded in an illegal format during export, while keeping the data in the database unchanged. Kindly use this parameter as necessary.
When enabled, the COMPATIBLE_ILLEGAL_CHARS parameter:
- Replaces illegal characters with the ones specified in the convert_illegal_char_mode parameter. The default replacement character is '?' (U+003F).
- Replaces zero characters (U+0000) with spaces (U+0020). If you do not need this replacement, configure the zero-character functionality in different compatible modes.
For detailed limitations on the COMPATIBLE_ILLEGAL_CHARS parameter, refer to the COMPATIBLE_ILLEGAL_CHARS description in the COPY section.

Typical Scenario Examples

The basic processing logic for independent zero characters is simple. This document focuses on streamlining the error tolerance process for complex exception scenarios where both zero characters (\0) and characters encoded in an illegal format are present in data streams.

Build UTF-8 zero characters and illegal characters.

gaussdb=# create database db_utf8 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'ORA';
CREATE DATABASE
gaussdb=# \c db_utf8
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db_utf8" as user "omm".
db_utf8=# create table test_encodings(id int, content text);
NOTICE:  The 'DISTRIBUTE BY' clause is not specified. Using 'id' as the distribution column by default.
HINT:  Please use 'DISTRIBUTE BY' clause to specify suitable data distribution column.
CREATE TABLE
db_utf8=# insert into test_encodings values(1, dbe_raw.cast_to_varchar2(dbe_raw.concat(hextoraw('2297'),
    dbe_raw.cast_from_varchar2_to_raw('Import/Export'))));
INSERT 0 1
db_utf8=# show client_encoding;
 client_encoding 
-----------------
 UTF8
(1 row)
-- The content in row 1 includes zero characters, while that in row 2 includes characters that are not encoded by UTF-8.
db_utf8=# select *, dbe_raw.cast_from_varchar2_to_raw(content) from test_encodings;
 id |  content  |  cast_from_varchar2_to_raw   
----+-----------+------------------------------
  1 | "Import/Export | 2297E5AFBCE585A5E5AFBCE587BA
(1 row)

Selecting the server's character set for file export allows for direct export without transcoding. However, if a different character set is selected, transcoding will be required. During transcoding, the system will report an error upon detecting the illegal UTF-8 character 0x97. In this situation, simply enable the compatible_illegal_chars parameter to ensure successful file export.
```
db_utf8=# copy test_encodings to '/home/xy/encodings.txt.utf8' encoding 'utf-8';
COPY 1
db_utf8=# copy test_encodings to '/home/xy/encodings.txt.gb18030' encoding 'gb18030';
ERROR:  invalid byte sequence for encoding "UTF8": 0x97
db_utf8=# copy test_encodings to '/home/xy/encodings.txt.gb18030' encoding 'gb18030' compatible_illegal_chars;
COPY 1
```
Open the /home/xy/encodings.txt.utf8 file with UTF-8 encoding. In this example, the support_zero_character option and compatible_illegal_chars parameter are disabled. You will notice that there are garbled characters in the second column of the first row. Although no explicit exception is shown, the hexdump command reveals the presence of garbled characters. You can refer to this example to reproduce the problem, but specific data details are not provided here.
```
1	"Import/Export
```
Open the /home/xy/encodings.txt.gb18030 file with GB18030 encoding. You will notice that the illegal character in the second column of the first row has been replaced by a question mark (?).
```
1	"?Import/Export
```