数据存在错误时的导入操作指南

数据导入过程中的容错处理机制提供两种可选模式。

智能修正模式（自适应入库）

处理原则：以数据完整性优先，通过智能修正确保最大程度的数据入库。
适用场景：导入数据中出现列数异常（多列/缺失列/废弃列）以及字符异常的情况。
处理流程：
1. 处理列数异常（冗余列截断/缺失列补位/废弃列丢弃）。
2. 进行字符集转码与非法字符清洗。
3. 完整写入目标数据库。
输出结果：修正后的数据集。字符异常的情况会有GaussDB日志记录。

使用示例：

多列：为数据列与表列多的情况。此时执行COPY导入时并指定ignore_extra_data选项，GaussDB会把前面正确列的数据正常导入表中，而多余列的数据将被舍弃。

gaussdb=# create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=# \c db1
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db1" as user "omm".
db1=# create table test_copy(id int, content text);
CREATE TABLE
db1=# copy test_copy from stdin delimiter ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,导入,导出
>>\. 
ERROR:  extra data after last expected column
CONTEXT:  COPY test_copy, line 1: "1,导入,导出"
--未指定ignore_extra_data时导入失败，指定后导入成功。
db1=# copy test_copy from stdin delimiter ',' ignore_extra_data;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,导入,导出
>> \.
COPY 1
db1=# select * from test_copy;
 id | content 
----+---------
  1 | 导入
(1 row)

缺失列：当数据列少于表列时，执行COPY导入并指定fill_missing_fields选项，GaussDB会按顺序将数据与表字段进行匹配导入。对于没有对应数据的字段，若其有默认值，则设置为默认值；否则，将其设置为NULL。

        
         
           
           
             gaussdb=# create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=# \c db1
db1=# create table test_copy(id int, content text, comment1 text, comment2 text);
CREATE TABLE
db1=# copy test_copy from stdin delimiter ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,导入
>> \.
ERROR:  missing data for column "comment1"
CONTEXT:  COPY test_copy, line 1: "1,导入"
--未指定fill_missing_fields时导入失败，指定后导入成功。
db1=# copy test_copy from stdin delimiter ',' fill_missing_fields 'multi';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,导入
>> \.
COPY 1
db1=# select * from test_copy;
 id | content | comment1 | comment2 
----+---------+----------+----------
  1 | 导入    |          | 
(1 row)

            

          

        
       

废弃列：当数据中的某些列不需要入库时，执行COPY导入操作并指定FILLER选项，GaussDB会跳过指定列的字段。对于数据库中对应的字段，若其有默认值，则设置为默认值；否则，将其设置为NULL。

        
         
           
           
             gaussdb=# create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=# \c db1
db1=# create table test_copy(id int, comment text default '导出', content text);
CREATE TABLE
db1=# copy test_copy from stdin delimiter ',';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>>1,不需要导入,导入
>\.    
COPY 1
db1=# select * from test_copy;
 id |  comment   | content 
----+------------+---------
  1 | 不需要导入 | 导入
(1 row)
--若未指定filler(comment)，导入时第2列内容会被导入comment字段；指定该选项后则不会导入。在示例里，指定该选项时，comment列会插入默认值。
db1=# copy test_copy(id,comment,content) from stdin delimiter ',' filler (comment);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1,不需要导入,导入
>> \.
COPY 1
db1=# select * from test_copy;
 id |  comment   | content 
----+------------+---------
  1 | 不需要导入 | 导入
  1 | 导出       | 导入
(2 rows)

            

          

        
       

字符异常：在处理字符异常问题时，需要根据服务端与客户端编码是否一致，采用不同的应对策略。

编码一致：当服务端与客户端编码一致时，若待导入的数据中存在非法编码字符或零字符，可以通过设置GUC参数copy_special_character_version='no_error'来实现特定的导入处理。在这种设置下，GaussDB会对不符合编码信息的数据进行容错处理，不会抛出错误信息，而是直接按照原始编码将数据插入到表中。

数据文件可参考数据存在错误时的导出操作指南中示例生成的/home/xy/encodings.txt.utf8，同时创建UTF-8编码的数据库来模拟编码异常的文件在导入时不需要转码的场景。

          
           
             
             
               gaussdb=# create database db_utf8 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=# \c db_utf8
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db_utf8" as user "omm".
db_utf8=# create table test_encodings(id int, content text);
CREATE TABLE
db_utf8=# show copy_special_character_version;
 copy_special_character_version 
--------------------------------

(1 row)
db_utf8=# copy test_encodings from '/home/omm/temp/encodings.txt.utf8';
ERROR:  invalid byte sequence for encoding "UTF8": 0x97
CONTEXT:  COPY test_encodings, line 2
db_utf8=# set copy_special_character_version = 'no_error';
SET
db_utf8=# copy test_encodings from '/home/xy/encodings.txt.utf8';
COPY 2
db_utf8=# select * from test_encodings;
 id |  content  
----+-----------
  1 | 导入
  2 | "导入导出
(2 rows)

              

            

          
         

编码不一致：当服务端与客户端编码不一致时，数据转码成为必要步骤。COPY为此提供了compatible_illegal_chars参数来支持转码操作。在数据导入过程中，当GaussDB遇到非法字符时，会启用容错机制，对这些非法字符进行转换，将转换后的字符存入数据库，并且不会报错，也不会中断整个导入流程。这使得在编码不一致的复杂环境下，数据导入工作依然能够高效、稳定地完成。

严格校验模式（精准入库）

处理原则：以数据准确性优先，确保入库数据的绝对合规性。
适用场景：在医疗记录、金融交易等对数据精度要求极高的领域，若担心智能修正模式在导入数据时自动修正会影响数据准确性，可采用此模式。
处理流程：
1. 执行多级校验（列数异常、字符异常、数据类型转换异常及约束冲突异常）。
2. 生成错误诊断报告（含行号、错误类型、错误数据）。
3. 建立错误数据隔离区。
4. 仅通过校验的原始数据直接入库。
输出结果：纯净数据集以及错误明细报告（通过gs_copy_error_log与gs_copy_summary查看）。

容错级别：

Level1容错：适用于智能修正模式处理的所有异常，如数据源列数过多、数据类型转换错误、字段超长、转码异常等。具体使用方法如下：

        
             --当导入数据过程中出现数据类型错误的次数不超过 100 次时，导入不会报错，会继续导入下一行。若超过 100 次，则正常报错。错误数据的详情及行号会记录在gs_copy_error_log表中。
gaussdb=# copy test_copy from '/home/omm/temp/test.csv' log errors reject limit '100' csv;
--相较于上条语句，下面这条会在gs_copy_error_log中额外记录错误行的完整数据，在无数据安全风险的场景下推荐使用。该语句需要系统管理员权限。
gaussdb=# copy test_copy from '/home/omm/temp/test.csv' log errors data reject limit '100' csv;

Level2容错：在Level1基础上，还支持处理数据中的约束冲突错误，包括非空约束、条件约束、主键约束、唯一性约束和唯一性索引等问题。具体使用方法如下：

        
             --设置如下GUC后，采用与Level1容错相同的COPY语句，容错逻辑也与Level1一致，只是额外支持约束冲突类型的错误。
gaussdb=# set a_format_load_with_constraints_violation = 's2';
gaussdb=# copy test_copy from '/home/omm/temp/test.csv' log errors data reject limit '100' csv;

使用示例：

Level1容错：

        
         
           
           
             gaussdb=# create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=# \c db1
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db1" as user "omm".
db1=# create table t1(a int primary key, b char(2) not null, c text check(length(c) < 2), comment text);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "t1_pkey" for table "t1"
CREATE TABLE
db1=# copy t1 from stdin;
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1	a	a	期望正常导入
>> a	b		列1类型错误，Level1容错，记录入错误表
>> 3	ccc		列2长度超长，Level1容错，记录入错误表
>> 4	d	d	额外多列5，Level1容错，记录入错误表	e
>> \.
ERROR:  invalid input syntax for integer: "a"
CONTEXT:  COPY t1, line 2, column a: "a"
db1=# copy t1 from stdin log errors data reject limit '100';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1	a	a	期望正常导入
>> a	b		列1类型错误，Level1容错，记录入错误表
>> 3	ccc		列2长度超长，Level1容错，记录入错误表
>> 4	d	d	额外多列5，Level1容错，记录入错误表	e
>> \.
db1=# select * from t1;
 a | b  | c |   comment    
---+----+---+--------------
 1 | a  | a | 期望正常导入
(1 row)

db1=# select * from pgxc_copy_error_log;
  relname  |           begintime           | filename | lineno |                             rawrecord                             |       
         detail                 
-----------+-------------------------------+----------+--------+-------------------------------------------------------------------+-------
--------------------------------
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      2 | a       b               列1类型错误，Level1容错，记录入错误表     | invali
d input syntax for integer: "a"
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      3 | 3       ccc             列2长度超长，Level1容错，记录入错误表     | value 
too long for type character(2)
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      4 | 4       d       d       额外多列5，Level1容错，记录入错误表     e | extra 
data after last expected column
(3 rows)

            

          

        
       

Level2容错：

gaussdb=# create database db1 encoding='UTF-8' LC_COLLATE='en_US.UTF-8' LC_CTYPE ='en_US.UTF-8' dbcompatibility = 'A';
CREATE DATABASE
gaussdb=# \c db1
Non-SSL connection (SSL connection is recommended when requiring high-security)
You are now connected to database "db1" as user "omm".
db1=# create table t1(a int primary key, b char(2) not null, c text check(length(c) < 2), comment text);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "t1_pkey" for table "t1"
CREATE TABLE
db1=# copy t1 from stdin log errors data reject limit '100';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1	a	a	期望正常导入
>> a	b		列1类型错误，Level1容错，记录入错误表
>> 3	ccc		列2长度超长，Level1容错，记录入错误表
>> 4	d	d	额外多列5，Level1容错，记录入错误表	e
>> 1	a		列1主键约束冲突，Level2容错，记录入错误表
>> 1			列2非空约束冲突，Level2容错，记录入错误表
>> 1	a	aaa	列3check约束冲突，Level2容错，记录入错误表
>> \.
ERROR:  The null value in column "b" violates the not-null constraint.
DETAIL:  Failing row contains (1, null, null, 列2非空约束冲突，Level2容错，记录入错误表).
CONTEXT:  COPY t1, line 6: "1			列2非空约束冲突，Level2容错，记录入错误表"
db1=# set a_format_load_with_constraints_violation = 's2';
SET
db1=# copy t1 from stdin log errors data reject limit '100';
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> 1	a	a	期望正常导入
>> a	b		列1类型错误，Level1容错，记录入错误表
>> 3	ccc		列2长度超长，Level1容错，记录入错误表
>> 4	d	d	额外多列5，Level1容错，记录入错误表	e
>> 1	a		列1主键约束冲突，Level2容错，记录入错误表
>> 1			列2非空约束冲突，Level2容错，记录入错误表
>> 1	a	aaa	列3check约束冲突，Level2容错，记录入错误表
>> \.
db1=# select * from t1;
 a | b  | c |   comment    
---+----+---+--------------
 1 | a  | a | 期望正常导入
(1 row)

db1=# select * from pgxc_copy_error_log;
  relname  |           begintime           | filename | lineno |                             rawrecord                              |      
                        detail                              
-----------+-------------------------------+----------+--------+--------------------------------------------------------------------+------
------------------------------------------------------------
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      2 | a       b               列1类型错误，Level2容错，记录入错误表      | inval
id input syntax for integer: "a"
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      3 | 3       ccc             列2长度超长，Level2容错，记录入错误表      | value
 too long for type character(2)
 public.t1 | 2025-03-21 18:57:34.921818+08 | STDIN    |      4 | 4       d       d       额外多列5，Level2容错，记录入错误表     e  | extra
 data after last expected column
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      2 | a       b               列1类型错误，Level1容错，记录入错误表      | inval
id input syntax for integer: "a"
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      3 | 3       ccc             列2长度超长，Level1容错，记录入错误表      | value
 too long for type character(2)
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      4 | 4       d       d       额外多列5，Level1容错，记录入错误表     e  | extra
 data after last expected column
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      6 | 1                       列2非空约束冲突，Level2容错，记录入错误表  | The n
ull value in column "b" violates the not-null constraint.
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      5 | 1                       列2非空约束冲突，Level2容错，记录入错误表  | Dupli
cate key value violates unique constraint "t1".
 public.t1 | 2025-03-21 19:04:58.041069+08 | STDIN    |      7 | 1       a       aaa     列3check约束冲突，Level2容错，记录入错误表 | New r
ow in relation "t1" violates check constraint "t1_c_check".
(9 rows)

总结

智能修正模式与严格校验模式可以结合使用，且智能修正模式具有优先级。在进行COPY导入时，若已明确指定对数据异常采用智能修正，那么该行数据的处理将不会触发严格校验模式。这意味着错误表不会记录相应数据，同时也不会扣除reject limit次数。建议用户根据自身实际情况，权衡是否自动修正列异常与字符异常后入库，还是直接舍弃。

对于严格校验模式的两个级别，推荐用户默认选择Level1。这是因为Level1所支持的错误类型较为常见，并且不会对导入性能产生任何影响。而Level2目前仅在集中式A兼容环境下支持，开启该特性会额外消耗导入性能和内存资源。具体信息请参见COPY支持约束冲突容错能力。因此，不建议用户默认使用Level2，仅在明确数据存在约束类型冲突时再开启。

父主题： COPY导入导出最佳实践

上一篇：数据存在错误时的导出操作指南

下一篇：工具导入导出最佳实践