Esta página ainda não está disponível no idioma selecionado. Estamos trabalhando para adicionar mais opções de idiomas. Agradecemos sua compreensão.

Compute
Elastic Cloud Server
Huawei Cloud Flexus
Bare Metal Server
Auto Scaling
Image Management Service
Dedicated Host
FunctionGraph
Cloud Phone Host
Huawei Cloud EulerOS
Networking
Virtual Private Cloud
Elastic IP
Elastic Load Balance
NAT Gateway
Direct Connect
Virtual Private Network
VPC Endpoint
Cloud Connect
Enterprise Router
Enterprise Switch
Global Accelerator
Management & Governance
Cloud Eye
Identity and Access Management
Cloud Trace Service
Resource Formation Service
Tag Management Service
Log Tank Service
Config
OneAccess
Resource Access Manager
Simple Message Notification
Application Performance Management
Application Operations Management
Organizations
Optimization Advisor
IAM Identity Center
Cloud Operations Center
Resource Governance Center
Migration
Server Migration Service
Object Storage Migration Service
Cloud Data Migration
Migration Center
Cloud Ecosystem
KooGallery
Partner Center
User Support
My Account
Billing Center
Cost Center
Resource Center
Enterprise Management
Service Tickets
HUAWEI CLOUD (International) FAQs
ICP Filing
Support Plans
My Credentials
Customer Operation Capabilities
Partner Support Plans
Professional Services
Analytics
MapReduce Service
Data Lake Insight
CloudTable Service
Cloud Search Service
Data Lake Visualization
Data Ingestion Service
GaussDB(DWS)
DataArts Studio
Data Lake Factory
DataArts Lake Formation
IoT
IoT Device Access
Others
Product Pricing Details
System Permissions
Console Quick Start
Common FAQs
Instructions for Associating with a HUAWEI CLOUD Partner
Message Center
Security & Compliance
Security Technologies and Applications
Web Application Firewall
Host Security Service
Cloud Firewall
SecMaster
Anti-DDoS Service
Data Encryption Workshop
Database Security Service
Cloud Bastion Host
Data Security Center
Cloud Certificate Manager
Edge Security
Managed Threat Detection
Blockchain
Blockchain Service
Web3 Node Engine Service
Media Services
Media Processing Center
Video On Demand
Live
SparkRTC
MetaStudio
Storage
Object Storage Service
Elastic Volume Service
Cloud Backup and Recovery
Storage Disaster Recovery Service
Scalable File Service Turbo
Scalable File Service
Volume Backup Service
Cloud Server Backup Service
Data Express Service
Dedicated Distributed Storage Service
Containers
Cloud Container Engine
SoftWare Repository for Container
Application Service Mesh
Ubiquitous Cloud Native Service
Cloud Container Instance
Databases
Relational Database Service
Document Database Service
Data Admin Service
Data Replication Service
GeminiDB
GaussDB
Distributed Database Middleware
Database and Application Migration UGO
TaurusDB
Middleware
Distributed Cache Service
API Gateway
Distributed Message Service for Kafka
Distributed Message Service for RabbitMQ
Distributed Message Service for RocketMQ
Cloud Service Engine
Multi-Site High Availability Service
EventGrid
Dedicated Cloud
Dedicated Computing Cluster
Business Applications
Workspace
ROMA Connect
Message & SMS
Domain Name Service
Edge Data Center Management
Meeting
AI
Face Recognition Service
Graph Engine Service
Content Moderation
Image Recognition
Optical Character Recognition
ModelArts
ImageSearch
Conversational Bot Service
Speech Interaction Service
Huawei HiLens
Video Intelligent Analysis Service
Developer Tools
SDK Developer Guide
API Request Signing Guide
Terraform
Koo Command Line Interface
Content Delivery & Edge Computing
Content Delivery Network
Intelligent EdgeFabric
CloudPond
Intelligent EdgeCloud
Solutions
SAP Cloud
High Performance Computing
Developer Services
ServiceStage
CodeArts
CodeArts PerfTest
CodeArts Req
CodeArts Pipeline
CodeArts Build
CodeArts Deploy
CodeArts Artifact
CodeArts TestPlan
CodeArts Check
CodeArts Repo
Cloud Application Engine
MacroVerse aPaaS
KooMessage
KooPhone
KooDrive
Help Center/ GaussDB(DWS)/ Troubleshooting/ Data Import and Export/ Data Import and Export Faults with GDS

Data Import and Export Faults with GDS

Updated on 2023-04-18 GMT+08:00

When GDS is used to import or export data, character set problems may occur, especially when data is migrated across databases of different types or different encoding types. As a result, data import fails, severely blocking data migration and other onsite operations.

Locale Support

Locale support refers to cultural preference compliance of applications, including alphabetic, sorting, and number formats. A locale is automatically initialized when a database is created using initdb. By default, initdb initializes the database based on the locale of its execution environment. The locale is preset in the system. You can use initdb –locale=xx to specify another locale.

If you need to use the rules of multiple locales, you can use the following locale categories to control the localization rules. These categories are converted to initdb options, indicating locale selections of a particular category.

Table 1 Locale support

Category

Description

LC_COLLATE

Defines character-collation or string-collation information.

LC_CTYPE

Defines character classification, for example, what counts as a character, case conversion, and others.

LC_MESSAGES

Defines the language used by messages.

LC_MONETARY

Defines formatting for monetary numeric information.

LC_NUMERIC

Defines formatting for non-monetary numeric information.

LC_TIME

Defines formatting for time and date information.

If you want the system to behave like there is no locale support, you can use the C locale or its equivalent the POSIX locale. Using non-C or non-POSIX locales may affect the performance, because it slows character processing and prevents the use of normal indexes in the LIKE clause. Therefore, use non-C or non-POSIX locales only when they are required.

Some locale category values must be fixed upon the creation of a database. Different databases can use different value settings. However, once a database is created, these locale category values cannot be changed in the database. For example, LC_COLLATE and LC_CTYPE require fixed values upon the creation of a database. These two locale categories determine the collation of indexes, so they must be fixed. Otherwise indexes in the text column will break down. Default values of these two locale categories are set when initdb is running, and will be used to create new databases, unless they are otherwise specified by the CREATE DATABASE command. Other locale category values can be changed at any time. To change a locale category value, set it to the configuration parameters of the server named the same as the locale category. The values chosen by initdb are written to the postgresql.conf file as the default values for the server startup. If you delete these values from the postgresql.conf file, the server will inherit the settings from its execution environment.

In particular, the locale setting affects the following SQL features:

  • Data collation in the process of queries that use ORDER BY or a standard comparison operator on text data
  • UPPER, LOWER, and INITCAP functions
  • Pattern matching operators (such as LIKE, SIMILAR TO, and POSIX style regular expressions), case-insensitive matching, and character classification using character regular expressions
  • TO_CHAR functions

Therefore, inconsistent query result sets in these scenarios are probably caused by the character set.

Collation Support

This feature allows you to specify the data collation and character classification behavior for each column or even each operation, which relaxes the restriction that the LC_COLLATE and LC_CTYPE settings cannot be changed since the database is created.

The collation of an expression can be a default rule that matches the locale of the database. The collation rule can also be uncertain. In this case, the collation operation and other operations that depend on the collation rule will fail.

When the database system must perform collation or character classification, it uses the collation rule of the input expression. This happens when you use the ORDER BY clause and functions, or when you call an operator (such as <). The collation rule used by the ORDER BY clause is the collation rule of the sort key. The collation rule used by a function or an operator call is derived from their parameters. In addition to comparison operators, functions (such as LOWER, UPPER, and INITCAP) that convert between uppercase and lowercase letters will refer to the collation rule. Pattern matching operators and TO_CHAR related functions also need to refer to the collation rule.

For a function or an operator call, it checks the collation parameter when the specified operation is performed, to obtain the collation rule. If the result of the function or operator call is a sortable data set, the collation rule is also used by the function or operator expression in case a peripheral expression requires the collation rule of the function or operator expression.

The collation derivation of an expression can be implicit or explicit. This distinction determines how collations are organized when multiple different collations appear in an expression. An explicit collation derivation occurs when a COLLATE clause is used. All other collation derivations are implicit. When multiple collations need to be organized, for example in a function call, the following rules are used:

  1. If any input expression has an explicit collation derivation, all explicitly derived collations among the input expressions must be the same, otherwise an error is raised. If any explicitly derived collation exists, it is the result of the collation combination.
  2. Otherwise, all input expressions must have the same implicit collation derivation or the default collation. If any non-default collation exists, that is the result of the collation combination. Otherwise, the result is the default collation.
  3. If there are conflicting non-default implicit collations among the input expressions, the combination is deemed to have indeterminate collation. This is not an error unless the particular function called requires knowledge of the collation it should apply. If it does, an error will be raised at runtime.

Character Set Support

The character set support in PostgreSQL enables you to store text in various character sets (also called encodings), including single-byte character sets such as the ISO 8859 series and multiple-byte character sets such as EUC (Extended Unix Code), UTF-8, and Mule internal code. MPPDB mainly uses the GBK, UTF-8, and LATIN1 character sets. All supported character sets can be used transparently by clients, but a few are not supported for use within the server (that is, as a server-side encoding. GBK encoding in PostgreSQL database is only client-side encoding, not server-side encoding. MPPDB introduces GBK to server-side encoding, which is the root cause of many problems). The default character set is selected while initializing your PostgreSQL database using initdb. It can be overridden when you create a database, so you can have multiple databases each with a different character set. An important restriction is that each database's character set must be compatible with the database's LC_CTYPE (character classification) and LC_COLLATE (string sort order) locale settings. For the C or POSIX locale, any character set is allowed, but for other locales there is only one character set that can work correctly. On Windows, however, UTF-8 encoding can be used with any locale.

The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard. Byte values 128-255 are deemed as uninterpreted characters. No encoding conversion will be done if the setting is SQL_ASCII. Therefore, this setting is not a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with non-ASCII data, it is unwise to use the SQL_ASCII setting because PostgreSQL is unable to help you convert or validate non-ASCII characters.

Which encoding is supported by a database system is determined by three aspects: database server support, database access interface support, and client support.

  • Database server support

    Database server support means a server can receive character set in a certain encoding format and store the character set, and it can also provide the character set (including identifiers and character field values) to the client. In addition, it can convert the characters to other encoding formats, for example, from UTF-8 to GBK.

    You can specify the database server encoding when creating a database: CREATE DATABASE ... ENCODING ... // ASCII, UTF-8, EUC_CN, and more are supported.

    You can check database encoding: SHOW server_encoding

  • Database access interface support

    Database access interface support means that the API must be able to correctly read and write the characters of a certain encoding format, without any data loss or distortion. The following uses the JDBC interface as an example:

    The JDBC interface sets client_encoding based on the file.encoding of JVM: set client_encoding to file_encoding

    Then converts a string to a byte stream encoded in the client_encoding format, and send the byte stream to the server. Prototype: String.getBytes(client_encoding).

    After receiving the byte stream from the server, the client uses client_encoding to construct a string object as the return value of getString and send the object to the application. Prototype: String(byte[], ..., client_encoding).

  • Client support

    Client support means that the client can display the characters that are read from the database and can submit the characters to the server.

    You can specify the client encoding of a session: SET CLIENT_ENCODING TO'value'

    You can check database encoding: Show client_encoding

Solutions to Character Set Problems During GDS Import and Export

Problem 1: 0x00 characters cannot be saved to the database. ERROR: invalid byte sequence for encoding "UTF8": 0x00

Cause: The PostgreSQL database does not allow 0x00 characters in text data. This is a baseline problem. Other databases do not have this problem.

Solution:

  1. Replace 0x00 characters.
  2. The COPY command and GDS both have the compatible_illegal_chars option. If this option is enabled (the COPY command and GDS foreign table can be altered), single-byte or multi-byte invalid characters will be replaced with spaces or question marks (?). In this way, the data can be imported successfully, but the original data is changed.
  3. Create a database whose encoding is SQL_ASCII and set client_encoding to SQL_ASCII (you can set it using the COPY command or in the GDS foreign table). In this case, special processing and conversion of the character set can be avoided, all sorting, comparison, and processing related to the library are processed as single bytes.

Usamos cookies para aprimorar nosso site e sua experiência. Ao continuar a navegar em nosso site, você aceita nossa política de cookies. Saiba mais

Feedback

Feedback

Feedback

0/500

Selected Content

Submit selected content with the feedback