Creating a Hive Catalog
Introduction
Catalogs provide metadata, such as databases, tables, partitions, views, and functions and information needed to access data stored in a database or other external systems.
One of the most crucial aspects of data processing is managing metadata. It may be transient metadata like temporary tables, or UDFs registered against the table environment. Or permanent metadata, like that in a Hive Metastore. Catalogs provide a unified API for managing metadata and making it accessible from the Table API and SQL Queries. For details, see Apache Flink Catalogs.
Function
The HiveCatalog serves two purposes; as persistent storage for pure Flink metadata, and as an interface for reading and writing existing Hive metadata.
Flink's Hive documentation provides full details on setting up the catalog and interfacing with an existing Hive installation. For details, see Apache Flink Hive Catalog.
HiveCatalog can be used to handle two kinds of tables: Hive-compatible tables and generic tables.
- Hive-compatible tables are those stored in a Hive-compatible way, in terms of both metadata and data in the storage layer. Therefore, Hive-compatible tables created via Flink can be queried from Hive side.
- Generic tables, on the other hand, are specific to Flink. When creating generic tables with HiveCatalog, we're just using HMS to persist the metadata. While these tables are visible to Hive, it is unlikely Hive is able to understand the metadata. And therefore using such tables in Hive leads to undefined behavior.
You are advised to switch to Hive dialect to create Hive-compatible tables. If you want to create Hive-compatible tables with default dialect, make sure to set 'connector'='hive' in your table properties, otherwise a table is considered generic by default in HiveCatalog. Note that the connector property is not required if you use Hive dialect. Refer to Hive Dialect.
Caveats
- Warning: The Hive Metastore stores all meta-object names in lower case.
- If a directory with the same name already exists, an exception is thrown.
- To use Hudi tables, you need to use the Hudi catalog, which is not compatible with the Hive catalog.
- When you create a Flink OpenSource SQL job, set Flink Version to 1.15 in the Running Parameters tab. Select Save Job Log, and specify the OBS bucket for saving job logs.
Syntax
CREATE CATALOG myhive WITH ( 'type' = 'hive', 'default-database' = 'default', 'hive-conf-dir' = '/opt/flink/conf' ); USE CATALOG myhive;
Parameter Description
Parameter |
Mandatory |
Default Value |
Data Type |
Description |
---|---|---|---|---|
type |
Yes |
None |
String |
Catalog type. Set this parameter to hive when creating a HiveCatalog. |
hive-conf-dir |
Yes |
None |
String |
This refers to the URI that points to the directory containing the hive-site.xml file. The value is fixed at 'hive-conf-dir' = '/opt/flink/conf'. |
default-database |
No |
default |
String |
When a catalog is set as the current catalog, the default current database is used. |
Supported Types
The HiveCatalog supports universal tables for all Flink types.
For Hive-compatible tables, the HiveCatalog needs to map Flink data types to their corresponding Hive types.
Flink SQL Type |
Hive Data Type |
---|---|
CHAR(p) |
CHAR(p) |
VARCHAR(p) |
VARCHAR(p) |
STRING |
STRING |
BOOLEAN |
BOOLEAN |
TINYINT |
TINYINT |
SMALLINT |
SMALLINT |
INT |
INT |
BIGINT |
LONG |
FLOAT |
FLOAT |
DOUBLE |
DOUBLE |
DECIMAL(p, s) |
DECIMAL(p, s) |
DATE |
DATE |
TIMESTAMP(9) |
TIMESTAMP |
BYTES |
BINARY |
ARRAY<T> |
LIST<T> |
MAP |
MAP |
ROW |
STRUCT |
- The maximum length for Hive's CHAR(p) is 255.
- The maximum length for Hive's VARCHAR(p) is 65535.
- Hive's MAP only supports primitive types as keys, while Flink's MAP can be any data type.
- Hive does not support UNION types.
- Hive's TIMESTAMP always has a precision of 9 and does not support other precisions. However, Hive UDFs can handle TIMESTAMP values with precision <= 9.
- Hive does not support Flink's TIMESTAMP_WITH_TIME_ZONE, TIMESTAMP_WITH_LOCAL_TIME_ZONE, and MULTISET.
- Flink's INTERVAL type cannot yet be mapped to Hive's INTERVAL type.
Example
- Create a catalog named myhive in the Flink OpenSource SQL job and use it to manage metadata.
CREATE CATALOG myhive WITH ( 'type' = 'hive' ,'hive-conf-dir' = '/opt/flink/conf' ); USE CATALOG myhive; create table dataGenSource( user_id string, amount int ) with ( 'connector' = 'datagen', 'rows-per-second' = '1', --Generates a piece of data per second. 'fields.user_id.kind' = 'random', --Specifies a random generator for the user_id field. 'fields.user_id.length' = '3' --Limits the length of user_id to 3. ); create table printSink( user_id string, amount int ) with ( 'connector' = 'print' ); insert into printSink select * from dataGenSource;
- Check if the dataGenSource and printSink tables exist in the default database.
The Hive Metastore stores all meta-object names in lower case.
Figure 1 Checking the default database
- Create a new Flink OpenSource SQL job using the metadata in the catalog named myhive.
CREATE CATALOG myhive WITH ( 'type' = 'hive' ,'hive-conf-dir' = '/opt/flink/conf' ); USE CATALOG myhive; insert into printSink select * from dataGenSource;
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot