Data Import and Export Faults with GDS

When GDS is used to import or export data, character set problems may occur, especially when data is migrated across databases of different types or different encoding types. As a result, data import fails, severely blocking migration-related operations.

Locale Support

Locale support refers to cultural preference compliance of applications, including alphabetic, sorting, and number formats. A locale is automatically initialized when a database is created using initdb. By default, initdb initializes the database based on the locale of its execution environment. The locale is preset in the system. You can use initdb –locale=xx to specify another locale.

If you need to use the rules of multiple locales, you can use the following locale categories to control the localization rules. These categories are converted to initdb options, indicating locale selections of a particular category.

**Table 1** Locale support
Category	Description
LC_COLLATE	Defines character-collation or string-collation information.
LC_CTYPE	Defines character classification, for example, what counts as a character, case conversion, and others.
LC_MESSAGES	Defines the language used by messages.
LC_MONETARY	Defines formatting for monetary numeric information.
LC_NUMERIC	Defines formatting for non-monetary numeric information.
LC_TIME	Defines formatting for time and date information.

If you want the system to behave like there is no locale support, you can use the C locale or its equivalent the POSIX locale. Using non-C or non-POSIX locales may affect the performance, because it slows character processing and prevents the use of normal indexes in the LIKE clause. Therefore, use non-C or non-POSIX locales only when they are required.

Some locale category values must be fixed upon the creation of a database. Different databases can use different value settings. However, once a database is created, these locale category values cannot be changed in the database. For example, LC_COLLATE and LC_CTYPE require fixed values upon the creation of a database. These two locale categories determine the collation of indexes, so they must be fixed. Otherwise indexes in the text column will break down. Default values of these two locale categories are set when initdb is running, and will be used to create new databases, unless they are otherwise specified by the CREATE DATABASE command. Other locale category values can be changed at any time. To change a locale category value, set it to the configuration parameters of the server named the same as the locale category. The values chosen by initdb are written to the postgresql.conf file as the default values for the server startup. If you delete these values from the postgresql.conf file, the server will inherit the settings from its execution environment.

In particular, the locale setting affects the following SQL features:

Data collation in the process of queries that use ORDER BY or a standard comparison operator on text data
UPPER, LOWER, and INITCAP functions
Pattern matching operators (such as LIKE, SIMILAR TO, and POSIX style regular expressions), case-insensitive matching, and character classification using character regular expressions
TO_CHAR functions

Therefore, inconsistent query result sets in these scenarios are probably caused by the character set.

Collation Support

This feature allows you to specify the data collation and character classification behavior for each column or even each operation, which relaxes the restriction that the LC_COLLATE and LC_CTYPE settings cannot be changed since the database is created.

The collation of an expression can be a default rule that matches the locale of the database. The collation rule can also be uncertain. In this case, the collation operation and other operations that depend on the collation rule will fail.

When the database system must perform collation or character classification, it uses the collation rule of the input expression. This happens when you use the ORDER BY clause and functions, or when you call an operator (such as <). The collation rule used by the ORDER BY clause is the collation rule of the sort key. The collation rule used by a function or an operator call is derived from their parameters. In addition to comparison operators, functions (such as LOWER, UPPER, and INITCAP) that convert between uppercase and lowercase letters will refer to the collation rule. Pattern matching operators and TO_CHAR related functions also need to refer to the collation rule.

For a function or an operator call, it checks the collation parameter when the specified operation is performed, to obtain the collation rule. If the result of the function or operator call is a sortable data set, the collation rule is also used by the function or operator expression in case a peripheral expression requires the collation rule of the function or operator expression.

The collation derivation of an expression can be implicit or explicit. This distinction determines how collations are organized when multiple different collations appear in an expression. An explicit collation derivation occurs when a COLLATE clause is used. All other collation derivations are implicit. When multiple collations need to be organized, for example in a function call, the following rules are used:

If any input expression has an explicit collation derivation, all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation exists, it is the result of the collation combination.
Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation exists, that is the result of the collation combination. Otherwise, the result is the default collation.
If there are conflicting non-default implicit collations among the input expressions, the combination is deemed to have indeterminate collation. This is not an error unless the particular function called requires knowledge of the collation it should apply. If it does, an error will be raised at runtime.

Character Set Support

The character set support in PostgreSQL enables you to store text in various character sets (also called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code. MPPDB mainly uses the GBK, UTF-8, and LATIN1 character sets. All supported character sets can be used transparently by clients, but a few are not supported for use within the server (that is, as a server-side encoding. GBK encoding in PostgreSQL database is only client-side encoding, not server-side encoding. MPPDB introduces GBK to server-side encoding, which is the root cause of many problems). The default character set is selected while initializing your PostgreSQL database using initdb. It can be overridden when you create a database, so you can have multiple databases each with a different character set. An important restriction is that each database's character set must be compatible with the database's LC_CTYPE (character classification) and LC_COLLATE (string sort order) locale settings. For the C or POSIX locale, any character set is allowed, but for other locales there is only one character set that can work correctly. On Windows, however, UTF-8 encoding can be used with any locale.

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard. Byte values 128-255 are deemed as uninterpreted characters. No encoding conversion will be done if the setting is SQL_ASCII. Therefore, this setting is not a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL is unable to help you convert or validate non-ASCII characters.

Which encoding is supported by a database system is determined by three aspects: database server support, database access interface support, and client support.

Database server support
Database server support means a server can receive character set in a certain encoding format and store the character set, and it can also provide the character set (including identifiers and character field values) to the client. In addition, it can convert the characters to other encoding formats, for example, from UTF-8 to GBK.

You can specify the database server encoding when creating a database: CREATE DATABASE ... ENCODING ... // ASCII, UTF-8, EUC_CN, and more are supported.

You can check database encoding: SHOW server_encoding
Database access interface support
Database access interface support means that the API must be able to correctly read and write the characters of a certain encoding format, without any data loss or distortion. The following uses the JDBC interface as an example:

The JDBC interface sets client_encoding based on the file.encoding of JVM: set client_encoding to file_encoding

Then converts a string to a byte stream encoded in the client_encoding format, and send the byte stream to the server. Prototype: String.getBytes(client_encoding).

After receiving the byte stream from the server, the client uses client_encoding to construct a string object as the return value of getString and send the object to the application. Prototype: String(byte[], ..., client_encoding).
Client support
Client support means that the client can display the characters that are read from the database and can submit the characters to the server.

You can specify the client encoding of a session: SET CLIENT_ENCODING TO'value'

You can check database encoding: Show client_encoding

Solutions to Character Set Problems During GDS Import and Export

Problem 1: 0x00 characters cannot be saved to the database. ERROR: invalid byte sequence for encoding "UTF8": 0x00

Cause: The PostgreSQL database does not allow 0x00 characters in text data. This is a baseline problem. Other databases do not have this problem.

Solution:

Replace 0x00 characters.
The COPY command and GDS both have the compatible_illegal_chars option. If this option is enabled (the COPY command and GDS foreign table can be altered), single-byte or multi-byte invalid characters will be replaced with spaces or question marks (?). In this way, the data can be imported successfully, but the original data is changed.
Create a database whose encoding is SQL_ASCII and set client_encoding to SQL_ASCII (you can set it using the COPY command or in the GDS foreign table). In this case, special processing and conversion of the character set can be avoided, all sorting, comparison, and processing related to the library are processed as single bytes.

Parent topic: Data Import and Export

Previous topic: "ERROR: invalid byte sequence for encoding 'UTF8': 0x00" Is Reported When Data Is Imported to DWS Using COPY FROM

Next topic: Failed to Create a GDS Foreign Table and An Error Is Reported Indicating that ROUNDROBIN Is Not Supported