Updated on 2023-04-28 GMT+08:00

Configuring a Co-deployed Hive Data Source

Scenario

Add a Hive data source that is in the same Hadoop cluster as HetuEngine on HSConsole.

Currently, HetuEngine supports data sources of the following data formats: AVRO, TEXT, RCTEXT, ORC, Parquet, and SequenceFile.

When HetuEngine interconnects with Hive, you cannot specify multiple delimiters during table creation. However, if the MultiDelimitSerDe class is specified as the serialization class for a Hive data source to create a multi-delimiter table in text format, you can query the table using HetuEngine.

Prerequisites

A HetuEngine compute instance has been created.

The HetuEngine service is interconnected to its co-deployed Hive data source by default during its installation. The data source name is hive and cannot be deleted. Some default configurations cannot be modified. You need to restart the HetuEngine service to automatically synchronize these unmodifiable configurations once they are updated.

To use the isolation function of Hive Metastore, you need to configure HIVE_METASTORE_URI_HETU on Hive, restart the Hsbroke instance in the HetuEngine service, and update the Hive Metastore URI.

Procedure

  1. Log in to FusionInsight Manager as a HetuEngine administrator and choose Cluster > Services > HetuEngine.
  2. In the Basic Information area on the Dashboard page, click the link next to HSConsole WebUI.
  3. On HSConsole, choose Data Source. Locate the row that contains the target Hive data source, click Edit in the Operation column, and modify the configurations. The following table describes data source configurations that can be modified.

    Parameter

    Description

    Example Value

    Enable Data Source Authentication

    Whether to use the permission policy of the Hive data source for authentication.

    If Ranger is disabled for the HetuEngine service, select Yes. If Ranger is enabled, select No.

    No

    yarn-site File

    Obtain the file from the Yarn/config directory on the data source client. Upload this file only when the Hudi data source is connected.

    Do not modify this parameter when configuring the Hive data source.

    -

    Enable Connection Pool

    Whether to enable the connection pool when accessing Hive MetaStore. The default value is Yes

    Yes

    Maximum Connections

    Maximum number of connections in the connection pool when accessing Hive MetaStore.

    50 (Value range: 20–200)

  4. (Optional) If you need to add Custom Configuration, complete the configurations by referring to 6.g and click OK to save the configurations.

Data Type Mapping

Currently, Hive data sources support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, REAL, DOUBLE, DECIMAL, NUMERIC, DEC, VARCHAR, VARCHAR (X), CHAR, CHAR (X), STRING, DATE, TIMESTAMP, TIME WITH TIMEZONE, TIMESTAMP WITH TIME ZONE, TIME, ARRAY, MAP, UNIOMTYPE, STRUCT, and ROW.

Performance Optimization

  • Metadata caching

    Hive connectors support metadata caching to provide metadata requests for various operations faster. For details, see Adjusting Metadata Cache.

  • Cost-based Optimization (CBO)

    Periodically running the analyze command to collect table statistics helps perform CBOs for Hive connectors.

  • Dynamic filtering

    Enabling dynamic filtering helps optimize the calculation of the Join operator of Hive connectors. For details, see Enabling Dynamic Filtering.

  • Query with partition conditions

    Creating a partitioned table and querying data with partition filter criteria help filter out some partition data, improving performance.

  • INSERT statement optimization

    You can improve insert performance by setting task.writer-count to 1 and choosing a larger value for hive.max-partitions-per-writers. For details, see Optimizing INSERT Statements.

Constraints

  • The DELETE syntax can be used to delete data from an entire table or a specified partition in a partitioned table. For a transaction table (the attribute transactional is set to true), if the WHERE condition is specified, the rows that match the condition are deleted.
  • The Hive metabase does not support schema renaming, that is, the ALTER SCHEMA RENAME syntax is not supported.
  • Only ACID transactions in Hive 3.x or later tables are supported.
  • Only transaction tables in ORC format are supported.