Configuring a Co-deployed Hive Data Source

Scenario

Add a Hive data source that is in the same Hadoop cluster as HetuEngine on HSConsole.

Currently, HetuEngine supports data sources of the following data formats: AVRO, TEXT, RCTEXT, ORC, Parquet, and SequenceFile.

When HetuEngine interconnects with Hive, you cannot specify multiple delimiters during table creation. However, if the MultiDelimitSerDe class is specified as the serialization class for a Hive data source to create a multi-delimiter table in text format, you can query the table using HetuEngine.

The Hive data source interconnected with HetuEngine supports Hudi table redirection. This function is available to MRS 3.3.0 or later. Hudi table access requests are redirected to the Hudi connector, so the advanced functions of the Hudi connector are available. To use this function, you need to configure the target Hudi data source, ensure that the Metastore URL of the Hudi data source is the same as that of the current Hive data source, and enable Hudi redirection for the Hive data source.

Prerequisites

A HetuEngine compute instance has been created.

The HetuEngine service is interconnected to its co-deployed Hive data source by default during its installation. The data source name is hive and cannot be deleted. Some default configurations cannot be modified. You need to restart the HetuEngine service to automatically synchronize these unmodifiable configurations once they are updated.

To use the isolation function of Hive Metastore, you need to configure HIVE_METASTORE_URI_HETU on Hive, restart the Hsbroke instance in the HetuEngine service, and update the Hive Metastore URI.

Procedure

Log in to FusionInsight Manager as a HetuEngine administrator and choose Cluster > Services > HetuEngine.
In the Basic Information area on the Dashboard page, click the link next to HSConsole WebUI.

On HSConsole, choose Data Source. Locate the row that contains the target Hive data source, click Edit in the Operation column, and modify the configurations. The following table describes data source configurations that can be modified.

Parameter	Description	Example Value
Enable Data Source Authentication	Whether to use the permission policy of the Hive data source for authentication. If Ranger is disabled for the HetuEngine service, select Yes. If Ranger is enabled, select No.	No
yarn-site File	Obtain the file from the Yarn/config directory on the data source client. Upload this file only when the Hudi data source is connected. Do not modify this parameter when configuring the Hive data source.	-
Hudi Redirection (available for MRS 3.3.0 or later)	This parameter is available only when the Metastore URL of the target Hudi data source is the same as that of the current Hive data source. This function redirects Hudi table access request to the Hudi connector, so the advanced functions of the Hudi connector can be used.	No
Hudi Data Source (available for MRS 3.3.0 or later)	This parameter is required for Hudi redirection. All configured Hudi data sources are displayed in the drop-down list box. Select only the Hudi data source that has the same Metastore URL.	-
Enable Connection Pool	Whether to enable the connection pool when accessing Hive MetaStore. The default value is Yes	Yes
Maximum Connections	Maximum number of connections in the connection pool when accessing Hive MetaStore.	50 (Value range: 20–200)

(Optional) If you need to add Custom Configuration, complete the configurations by referring to 6.g and click OK to save the configurations.

Data Type Mapping

Currently, Hive data sources support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, REAL, DOUBLE, DECIMAL, NUMERIC, DEC, VARCHAR, VARCHAR (X), CHAR, CHAR (X), STRING, DATE, TIMESTAMP, TIME WITH TIMEZONE, TIMESTAMP WITH TIME ZONE, TIME, ARRAY, MAP, UNIOMTYPE, STRUCT, and ROW.

Performance Optimization

Metadata caching
Hive connectors support metadata caching to provide metadata requests for various operations faster. For details, see Adjusting Metadata Cache.
Cost-based Optimization (CBO)
Periodically running the analyze command to collect table statistics helps perform CBOs for Hive connectors.
Dynamic filtering
Enabling dynamic filtering helps optimize the calculation of the Join operator of Hive connectors. For details, see Enabling Dynamic Filtering.
Query with partition conditions
Creating a partitioned table and querying data with partition filter criteria help filter out some partition data, improving performance.
INSERT statement optimization
You can improve insert performance by setting task.writer-count to 1 and choosing a larger value for hive.max-partitions-per-writers. For details, see Optimizing INSERT Statements.

Constraints

The DELETE syntax can be used to delete data from an entire table or a specified partition in a partitioned table.
The Hive metabase does not support schema renaming, that is, the ALTER SCHEMA RENAME syntax is not supported.

Parent topic: Configuring a Hive Data Source

Previous topic: Configuring a Hive Data Source

Next topic: Configuring an Independently Deployed Hive Data Source