Configuring a Co-deployed Hive Data Source
Scenario
Add a Hive data source that is in the same Hadoop cluster as HetuEngine on HSConsole.
- Currently, HetuEngine supports data sources of the following data formats: AVRO, TEXT, RCTEXT, ORC, Parquet, and SequenceFile.
- When HetuEngine interconnects with Hive, you cannot specify multiple delimiters during table creation. However, if the MultiDelimitSerDe class is specified as the serialization class for a Hive data source to create a multi-delimiter table in text format, you can query the table using HetuEngine.
- The Hive data source interconnected with HetuEngine supports Hudi table redirection. Hudi table access requests are redirected to the Hudi connector, so the advanced functions of the Hudi connector are available. To use this function, you need to configure the target Hudi data source, ensure that the Metastore URL of the Hudi data source is the same as that of the current Hive data source, and enable Hudi redirection for the Hive data source.
During HetuEngine installation, the co-deployed Hive data source is interconnected by default. The data source name is hive and cannot be deleted. Some default configurations, such as the data source name, data source type, server principal, and client principal, cannot be modified. When the environment configuration changes, for example, the local domain name of the cluster is changed, restarting the HetuEngine service can automatically synchronize the configurations of the co-deployed Hive data source, such as server principal and client principal.
Prerequisites
- A HetuEngine compute instance has been created.
- To use the isolation function of Hive Metastore, you need to configure HIVE_METASTORE_URI_HETU on Hive and restart the Hsbroke instance of the HetuEngine service to update the Hive Metastore URI.
Procedure
- Log in to FusionInsight Manager as a HetuEngine administrator and choose Cluster > Services > HetuEngine.
- In the Basic Information area on the Dashboard page, click the link next to HSConsole WebUI.
- On HSConsole, choose Data Source. Locate the row that contains the target Hive data source, click Edit in the Operation column, and modify the configurations. The following table describes data source configurations that can be modified.
Parameter
Description
Example Value
Enable Data Source Authentication
Whether to use the permission policy of the Hive data source for authentication. After this function is enabled, HetuEngine uses SQL standard-based Hive authorization.
- Clusters with Kerberos authentication disabled (in normal mode): HetuEngine uses the default Hive authorization. This parameter is unavailable.
- Clusters with Kerberos authentication enabled (in security mode): When Ranger is enabled, HetuEngine additionally uses Ranger authentication in addition to the default Hive authorization. If this function is enabled, Ranger authentication is added on the basis of SQL standard-based Hive authorization. When Ranger is disabled, HetuEngine uses only SQL standard-based Hive authorization.
No
Hudi Redirection
This parameter is available only when the Metastore URL of the target Hudi data source is the same as that of the current Hive data source.
This function redirects Hudi table access request to the Hudi connector, so the advanced functions of the Hudi connector can be used.
No
Hudi Data Source
This parameter is required for Hudi redirection.
All configured Hudi data sources are displayed in the drop-down list box. Select only the Hudi data source that has the same Metastore URL.
-
Enable Connection Pool
Whether to enable the connection pool when accessing Hive MetaStore. The default value is Yes
Yes
Maximum Connections
Maximum number of connections in the connection pool when accessing Hive MetaStore.
50 (Value range: 20–200)
- (Optional) If you need to add Custom Configuration, complete the configurations by referring to 6.g and click OK to save the configurations.
Data Type Mapping
Currently, Hive data sources support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, REAL, DOUBLE, DECIMAL, NUMERIC, DEC, VARCHAR, VARCHAR (X), CHAR, CHAR (X), STRING, DATE, TIMESTAMP, TIME WITH TIMEZONE, TIMESTAMP WITH TIME ZONE, TIME, ARRAY, MAP, STRUCT, and ROW.
Performance Optimization
- Metadata caching
Hive connectors support metadata caching to provide metadata requests for various operations faster. For details, see Adjusting Metadata Cache.
- Dynamic filtering
Enabling dynamic filtering helps optimize the calculation of the Join operator of Hive connectors. For details, see Enabling Dynamic Filtering.
- Query with partition conditions
Creating a partitioned table and querying data with partition filter criteria help filter out some partition data, improving performance.
- INSERT statement optimization
You can improve insert performance by setting task.writer-count to 1 and choosing a larger value for hive.max-partitions-per-writers. For details, see Optimizing INSERT Statements.
Constraints
- The DELETE syntax can be used to delete data from an entire table or a specified partition in a partitioned table.
- The Hive metabase does not support schema renaming, that is, the ALTER SCHEMA RENAME syntax is not supported.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot