CREATE FOREIGN TABLE (SQL on Hadoop or OBS)
Function
CREATE FOREIGN TABLE (SQL on Hadoop) creates an HDFS foreign table in the current database to access Hadoop structured data stored on HDFS. An HDFS foreign table is read-only. It can only be queried using SELECT.
Syntax
Create an HDFS foreign table.
1 2 3 4 5 6 7 8 9 10 11 | CREATE FOREIGN TABLE [ IF NOT EXISTS ] table_name
( [ { column_name type_name
[ { [CONSTRAINT constraint_name] NULL |
[CONSTRAINT constraint_name] NOT NULL |
column_constraint [...]} ] |
table_constraint [, ...]} [, ...] ] )
SERVER server_name
OPTIONS ( { option_name ' value ' } [, ...] )
DISTRIBUTE BY {ROUNDROBIN | REPLICATION}
[ PARTITION BY ( column_name ) [ AUTOMAPPED ] ] ;
|
- column_constraint is as follows:
1 2 3
[CONSTRAINT constraint_name] {PRIMARY KEY | UNIQUE} [NOT ENFORCED [ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION] | ENFORCED]
- table_constraint is as follows:
1 2 3
[CONSTRAINT constraint_name] {PRIMARY KEY | UNIQUE} (column_name) [NOT ENFORCED [ENABLE QUERY OPTIMIZATION | DISABLE QUERY OPTIMIZATION] | ENFORCED]
Parameter Description
- IF NOT EXISTS
Does not throw an error if a table with the same name exists. A notice is issued in this case.
- table_name
Specifies the name of the foreign table to be created.
Value range: a string. It must comply with the naming convention.
- column_name
Specifies the name of a column in the foreign table. Columns are separated by commas (,).
Value range: a string. It must comply with the naming convention.
- type_name
Specifies the data type of the column.
Data types supported by tables in ORC format include: SMALLINT, INTEGER, and BIGINT, FLOAT4 (REAL), FLOAT8(DOUBLE PRECISION), DECIMAL[p(,s)] (maximum precision: 38 decimal points), DATE, TIMESTAMP, BOOLEAN, CHAR(n), VARCHAR(n), TEXT(CLOB).
The data types supported by TXT table are the same as those in row-store tables.
- constraint_name
Specifies the name of a constraint for the foreign table.
- { NULL | NOT NULL }
Specifies whether the column allows NULL.
When you create a table, whether the data in HDFS is NULL or NOT NULL cannot be guaranteed. The consistency of data is guaranteed by users. Users must decide whether the column is NULL or NOT NULL. (The optimizer optimizes the NULL/NOT NULL and generates a better plan.)
- SERVER server_name
Specifies the server name of the foreign table. Users can customize its name.
Value range: a string indicating an existing server. It must comply with the naming convention.
- OPTIONS ( { option_name ' value ' } [, ...] ) Specifies the following parameters for a foreign table:
- header
Specifies whether a data file contains a table header. header is available only for CSV files.
If header is on, the first row of the data file will be identified as the header and ignored during export. If header is off, the first row will be identified as a data row.
Value range: true, on, false, and off. The default value is false or off.
- quote
Specifies the quotation mark for the CSV format. The default value is a double quotation mark (").
The quote value cannot be the same as the delimiter or null value.
The quote value must be a single-byte character.
Invisible characters are recommended as quote values, such as 0x07, 0x08, and 0x1b.
- escape
Specifies an escape character for a CSV file. The value must be a single-byte character.
The default value is a double quotation mark ("). If the value is the same as the quote value, it will be replaced with \0.
- location
Specifies the file path on OBS. This is an OBS foreign table parameter. The data sources of multiple buckets are separated by vertical bars (|), for example, LOCATION 'obs://bucket1/folder/ | obs://bucket2/'. The database scans all objects in the specified folders.
- format: format of the data source file in the foreign table. ORC, TEXT, and CSV formats are supported.
- foldername: directory of the data source file in the foreign table, that is, the corresponding file directory in HDFS.
- encoding: encoding of data source files in foreign tables. The default value is utf8. This parameter is optional.
- totalrows: (Optional) estimated number of rows in a table. This parameter is used only for OBS foreign tables. Because OBS may store many files, it is slow to analyze data. This parameter allows you to set an estimated value so that the optimizer can estimate the table size according to the value. Generally, query efficiency is high when the estimated value is close to the actual value.
- filenames: data source files specified in the foreign table. Multiple files are separated by commas (,).
- You are advised to use foldername to specify the locations of data sources.
- An absolute path in foldername should be enclosed with slashes (/). Multiple paths are separated by commas (,).
- When you query a partitioned table, data is pruned based on partition information, and data files that meet the requirement are queried. Pruning involves scanning HDFS directory contents many times. Therefore, do not use columns with low repetition as partition column.
- An OBS foreign table is not supported.
- delimiter
Specifies the column delimiter of data, and uses the default delimiter if it is not set. The default delimiter of TEXT is a tab.
- A delimiter cannot be \r or \n.
- A delimiter cannot be the same as the null parameter.
- A delimiter cannot contain the following characters: \.abcdefghijklmnopqrstuvwxyz0123456789
- The data length of a single row should be less than 1 GB. A row that has many columns using long delimiters cannot contain much valid data.
- You are advised to use a multi-character, such as the combination of the dollar sign ($), caret (^), ampersand (&), or invisible characters, such as 0x07, 0x08, and 0x1b as the delimiter.
- delimiter is available only for TEXT and CSV source data files.
Valid value:
The value of delimiter can be a multi-character delimiter whose length is less than or equal to 10 bytes.
- null Specifies the string that represents a null value.
- The null value cannot be \r or \n. The maximum length is 100 characters.
- The null parameter cannot be the same as the delimiter.
- null is available only for TEXT and CSV source data files.
Valid value:
The default value is \N for the TEXT format.
- noescaping
Specifies in TEXT format, whether to escape the backslash (\) and its following characters.
noescaping is available only for TEXT source data files.
Value range: true, on, false, and off. The default value is false or off.
- fill_missing_fields
Specifies whether to generate an error message when the last column in a row in the source file is lost during data loading.
Value range: true, on, false, and off. The default value is false or off.
- If this parameter is set to true or on and the last column of a data row in a data source file is lost, the column is replaced with NULL and no error message will be generated.
- If this parameter is set to false or off and the last column is missing, the following error information will be displayed:
missing data for column "tt"
- Because SELECT COUNT(*) does not parse columns in TEXT format, it does not report missing columns.
- fill_missing_fields is available only for TEXT and CSV source data files.
- ignore_extra_data
Specifies whether to ignore excessive columns when the number of data source files exceeds the number of foreign table columns. This parameter is available during data import.
Value range: true, on, false, and off. The default value is false or off.
- If this parameter is set to true or on and the number of data source files exceeds the number of foreign table columns, excessive columns will be ignored.
- If this parameter is set to false or off and the number of data source files exceeds the number of foreign table columns, the following error information will be displayed:
extra data after last expected column
- If the newline character at the end of the row is lost, setting the parameter to true will ignore data in the next row.
- Because SELECT COUNT(*) does not parse columns in TEXT format, it does not report missing columns.
- ignore_extra_data is available only for TEXT and CSV source data files.
- date_format
Specifies the DATE format for data import. This syntax is available only for READ ONLY foreign tables.
Value range: any valid DATE value. For details, see Date and Time Processing Functions and Operators.
- If ORACLE is specified as the compatible database, the DATE format is TIMESTAMP. For details, see timestamp_format below.
- date_format is available only for TEXT and CSV source data files.
- time_format
Specifies the TIME format for data import. This syntax is available only for READ ONLY foreign tables.
Value range: a valid TIME value. Time zones cannot be used. For details, see Date and Time Processing Functions and Operators.
time_format is available only for TEXT and CSV source data files.
- timestamp_format
Specifies the TIMESTAMP format for data import. This syntax is available only for READ ONLY foreign tables.
Value range: any valid TIMESTAMP value. Time zones are not supported. For details, see Date and Time Processing Functions and Operators.
timestamp_format is available only for TEXT and CSV source data files.
- smalldatetime_format
Specifies the SMALLDATETIME format for data import. This syntax is available only for READ ONLY foreign tables.
Value range: a valid SMALLDATETIME value. For details, see Date and Time Processing Functions and Operators.
smalldatetime_format is available only for TEXT and CSV source data files.
- checkencoding
Specifies whether to check the character encoding.
Value range: low, high The default value is low.
In TEXT format, the rule of error tolerance for invalid characters imported is as follows:
- \0 is converted to a space.
- Other invalid characters are converted to question marks.
- Setting checkencoding to low enables invalid characters toleration. If NULL and DELIMITER are set to spaces or question marks (?), errors like "illegal chars conversion may confuse null 0x20" will be displayed, prompting you to modify parameters that may cause confusion and preventing importing errors.
In ORC format, the rule of error tolerance for invalid characters imported is as follows:
- If checkencoding is low, an imported field containing invalid characters will be replaced with a quotation mark string of the same length.
- If checkencoding is high, data import stops when an invalid character is detected.
Table 1 Support for TEXT, CSV, and ORC formats Parameter Name
OBS
HDFS
-
TEXT
CSV
ORC
TEXT
CSV
ORC
location
Supported
Supported
Supported
Not supported
Not supported
Not supported
format
Supported
Supported
Supported
Supported
Supported
Supported
header
Not supported
Supported
Not supported
Not supported
Supported
Not supported
delimiter
Supported
Supported
Not supported
Supported
Supported
Not supported
quote
Not supported
Supported
Not supported
Not supported
Supported
Not supported
escape
Not supported
Supported
Not supported
Not supported
Supported
Not supported
null
Supported
Supported
Not supported
Supported
Supported
Not supported
noescaping
Supported
Not supported
Not supported
Supported
Not supported
Not supported
encoding
Supported
Supported
Supported
Supported
Supported
Supported
fill_missing_fields
Supported
Supported
Not supported
Supported
Supported
Not supported
ignore_extra_data
Supported
Supported
Not supported
Supported
Supported
Not supported
date_format
Supported
Supported
Not supported
Supported
Supported
Not supported
time_format
Supported
Supported
Not supported
Supported
Supported
Not supported
timestamp_format
Supported
Supported
Not supported
Supported
Supported
Not supported
smalldatetime_format
Supported
Supported
Not supported
Supported
Supported
Not supported
chunksize
Supported
Supported
Not supported
Supported
Supported
Not supported
filenames
Not supported
Not supported
Not supported
Supported
Supported
Supported
foldername
Supported
Supported
Supported
Supported
Supported
Supported
checkencoding
Supported
Supported
Supported
Supported
Supported
Supported
totalrows
Supported
Supported
Supported
Not supported
Not supported
Not supported
- header
- DISTRIBUTE BY ROUNDROBIN
Specifies ROUNDROBIN as the distribution mode for the HDFS foreign table.
- DISTRIBUTE BY REPLICATION
Specifies REPLICATION as the distribution mode for the HDFS foreign table.
- PARTITION BY ( column_name ) AUTOMAPPED
column_name specifies the partition column. AUTOMAPPED means the partition column specified by the HDFS partitioned foreign table is automatically mapped with the partition directory information in HDFS. The prerequisite is that the sequences of partition columns specified in the HDFS foreign table and in the directory are the same.
Partitioned tables can be used as foreign tables for HDFS, but not for OBS.
- CONSTRAINT constraint_name
Specifies the name of informational constraint of the foreign table.
Value range: a string. It must comply with the naming convention.
- PRIMARY KEY
The primary key constraint specifies that one or more columns of a table must contain unique (non-duplicate) and non-null values. Only one primary key can be specified for a table.
- UNIQUE
Specifies that a group of one or more columns of a table must contain unique values. For the purpose of a unique constraint, NULL is not considered equal.
- NOT ENFORCED
Specifies the constraint to be an informational constraint. This constraint is guaranteed by the user instead of the database.
- ENFORCED
The default value is ENFORCED. ENFORCED is a reserved parameter and is currently not supported.
- PRIMARY KEY (column_name)
Specifies the informational constraint on column_name.
Value range: a string. It must comply with the naming convention, and the value of column_name must exist.
- ENABLE QUERY OPTIMIZATION
Optimizes an execution plan using an informational constraint.
- DISABLE QUERY OPTIMIZATION
Disables the optimization of an execution plan using an informational constraint.
Informational Constraint
In GaussDB(DWS), data is stored in HDFS. GaussDB(DWS) does not support writing data to HDFS. It is the user's responsibility to ensure enforcement of constraints. If the source data is compliant with certain informational constraint requirements, the query of such data can achieve higher efficiency. HDFS foreign tables do not support indexes. Informational constraints are used to optimize query plans.
The constraints of creating informational constraints for an HDFS foreign table are as follows:
- You can create an informational constraint only if the values in a NOT NULL column in your table are unique. Otherwise, the query result will be different from expected.
- Currently, the informational constraints of GaussDB(DWS) support only PRIMARY KEY and UNIQUE constraints.
- The informational constraints of GaussDB(DWS) support only the NOT ENFORCED attribute.
- Both an HDFS foreign table and an HDFS partitioned foreign table supports informational constraint, which is also established in a partitioned column).
- UNIQUE informational constraints can be created for multiple columns in a table, but only one PRIMARY KEY constraint can be created in a table.
- Multiple informational constraints can be established in a column of a table (because the function that establishing a column or multiple constraints in a column is the same.) Therefore, you are not advised to set up multiple informational constraints in a column, and only one Primary Key type can be set up.
- Multi-column combination constraints are not supported.
Examples
Example 1: In HDFS, import the TPC-H benchmark test tables part and region using Hive. The path of the part table is /user/hive/warehouse/partition.db/part_4, and that of the region table is /user/hive/warehouse/mppdb.db/region_orc11_64stripe/.
- Establish HDFS_Server, with HDFS_FDW or DFS_FDW as the foreign data wrapper.
1 2
-- Create HDFS_Server: CREATE SERVER hdfs_server FOREIGN DATA WRAPPER HDFS_FDW OPTIONS (address '10.10.0.100:25000,10.10.0.101:25000',hdfscfgpath '/opt/hadoop_client/HDFS/hadoop/etc/hadoop',type'HDFS');
The IP addresses and port numbers of HDFS NameNodes are specified in OPTIONS. 10.10.0.100:25000,10.10.0.101:25000 indicates the IP addresses and port numbers of the primary and standby HDFS NameNodes. It is the recommended format. Two groups of parameter values are separated by commas (,). Take '10.10.0.100:25000' as an example. In this example, the IP address is 10.10.0.100, and the port number is 25000.
- Create an HDFS foreign table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
-- Create an HDFS foreign table that does not contain any partition columns. The HDFS server associated with the table is hdfs_server, the corresponding file format of region on the HDFS server is 'orc', and the file directory in the HDFS file system is '/user/hive/warehouse/mppdb. db/region_orc11_64stripe/'. CREATE FOREIGN TABLE region ( R_REGIONKEY INT4, R_NAME TEXT, R_COMMENT TEXT ) SERVER hdfs_server OPTIONS ( FORMAT 'orc', encoding 'utf8', FOLDERNAME '/user/hive/warehouse/mppdb.db/region_orc11_64stripe/' ) DISTRIBUTE BY roundrobin; -- Create an HDFS foreign table that contains partition columns. CREATE FOREIGN TABLE part ( p_partkey int, p_name text, p_mfgr text, p_brand text, p_type text, p_size int, p_container text, p_retailprice float8, p_comment text ) SERVER hdfs_server OPTIONS ( FORMAT 'orc', encoding 'utf8', FOLDERNAME '/user/hive/warehouse/partition.db/part_4' ) DISTRIBUTE BY roundrobin PARTITION BY (p_mfgr) AUTOMAPPED;
GaussDB(DWS) allows you to specify files using the keyword filenames or foldername. The latter is recommended. The key word distribute specifies the storage distribution mode of the region table.
- View the created foreign table.
1 2 3 4 5 6 7 8 9 10 11 12
-- View the foreign table: SELECT * FROM pg_foreign_table WHERE ftrelid='region'::regclass; ftrelid | ftserver | ftwriteonly | ftoptions ---------+----------+-------------+------------------------------------------------------------------------------ 16510 | 16509 | f | {format=orc,foldername=/user/hive/warehouse/mppdb.db/region_orc11_64stripe/} (1 row) select * from pg_foreign_table where ftrelid='part'::regclass; ftrelid | ftserver | ftwriteonly | ftoptions ---------+----------+-------------+------------------------------------------------------------------ 16513 | 16509 | f | {format=orc,foldername=/user/hive/warehouse/partition.db/part_4} (1 row)
- Modify and delete the foreign table.
1 2 3 4 5 6 7 8
-- Modify a foreign table: ALTER FOREIGN TABLE region ALTER r_name TYPE TEXT; ALTER FOREIGN TABLE ALTER FOREIGN TABLE region ALTER r_name SET NOT NULL; ALTER FOREIGN TABLE -- Delete the foreign table: DROP FOREIGN TABLE region; DROP FOREIGN TABLE
Example 2: Operations on an HDFS foreign table that includes informational constraints
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | -- Create an HDFS foreign table with informational constraints
CREATE FOREIGN TABLE region (
R_REGIONKEY int,
R_NAME TEXT,
R_COMMENT TEXT
, primary key (R_REGIONKEY) not enforced)
SERVER hdfs_server
OPTIONS(format 'orc',
encoding 'utf8',
foldername '/user/hive/warehouse/mppdb.db/region_orc11_64stripe')
DISTRIBUTE BY roundrobin;
-- Check whether the region table has an informational constraint index:
SELECT relname,relhasindex FROM pg_class WHERE oid='region'::regclass;
relname | relhasindex
------------------------+-------------
region | f
(1 row)
SELECT conname, contype, consoft, conopt, conindid, conkey FROM pg_constraint WHERE conname ='region_pkey';
conname | contype | consoft | conopt | conindid | conkey
-------------+---------+---------+--------+----------+--------
region_pkey | p | t | t | 0 | {1}
(1 row)
-- Delete the informational constraint:
ALTER FOREIGN TABLE region DROP CONSTRAINT region_pkey RESTRICT;
SELECT conname, contype, consoft, conindid, conkey FROM pg_constraint WHERE conname ='region_pkey';
conname | contype | consoft | conindid | conkey
---------+---------+---------+----------+--------
(0 rows)
-- Add a unique informational constraint for the foreign table:
ALTER FOREIGN TABLE region ADD CONSTRAINT constr_unique UNIQUE(R_REGIONKEY) NOT ENFORCED;
-- Delete the informational constraint:
ALTER FOREIGN TABLE region DROP CONSTRAINT constr_unique RESTRICT;
SELECT conname, contype, consoft, conindid, conkey FROM pg_constraint WHERE conname ='constr_unique';
conname | contype | consoft | conindid | conkey
---------+---------+---------+----------+--------
(0 rows)
-- Add a unique informational constraint for the foreign table:
ALTER FOREIGN TABLE region ADD CONSTRAINT constr_unique UNIQUE(R_REGIONKEY) NOT ENFORCED disable query optimization;
SELECT relname,relhasindex FROM pg_class WHERE oid='region'::regclass;
relname | relhasindex
------------------------+-------------
region | f
(1 row)
-- Delete the informational constraint:
ALTER FOREIGN TABLE region DROP CONSTRAINT constr_unique CASCADE;
-- Delete the region table:
DROP FOREIGN TABLE region;
-- Delete the hdfs_server server:
DROP SERVER hdfs_server;
|
Example 3: Read data in OBS through a foreign table.
- Create obs_server, with DFS_FDW as the foreign data wrapper.
1 2 3 4 5 6 7
-- Create obs_server: CREATE SERVER obs_server FOREIGN DATA WRAPPER DFS_FDW OPTIONS ( ADDRESS 'obs.abc.com', ACCESS_KEY 'xxxxxxxxx', SECRET_ACCESS_KEY 'yyyyyyyyyyyyy', TYPE 'OBS' );
- ADDRESS indicates the IP address or domain name of OBS. Replace the value as needed. You can find the domain name by searching for the value of regionCode in the region_map file.
- ACCESS_KEY and SECRET_ACCESS_KEY are access keys for the cloud account system. Replace the values as needed.
- TYPE indicates the server type. Retain the value OBS.
- Create an OBS foreign table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
--- Create the customer_address foreign table that does not contain partition columns. obs_server is the associated OBS server. Files on this server are in .orc format and stored in the user/hive/warehouse/mppdb.db/region_orc11_64stripe1/ directory. CREATE FOREIGN TABLE customer_address ( ca_address_sk integer not null, ca_address_id char(16) not null, ca_street_number char(10) , ca_street_name varchar(60) , ca_street_type char(15) , ca_suite_number char(10) , ca_city varchar(60) , ca_county varchar(30) , ca_state char(2) , ca_zip char(10) , ca_country varchar(20) , ca_gmt_offset decimal(36,33) , ca_location_type char(20) ) SERVER obs_server OPTIONS ( FOLDERNAME '/user/hive/warehouse/mppdb.db/region_orc11_64stripe1/', FORMAT 'ORC', ENCODING 'utf8', TOTALROWS '20' ) DISTRIBUTE BY roundrobin;
- Query data stored in OBS using a foreign table.
1 2 3 4 5 6
-- View the foreign table: SELECT COUNT(*) FROM customer_address; count ------- 20 (1 row)
- Delete the foreign table.
1 2 3
-- Delete the foreign table: DROP FOREIGN TABLE customer_address; DROP FOREIGN TABLE
Helpful Links
Last Article: CREATE FOREIGN TABLE (for OBS Import and Export)
Next Article: CREATE FUNCTION
Did this article solve your problem?
Thank you for your score!Your feedback would help us improve the website.