Help Center/ MapReduce Service/ Component Operation Guide (LTS) (Ankara Region)/ Using HetuEngine/ Configuring Data Sources/ Configuring a Hive Data Source/ Configuring an Independently Deployed Hive Data Source

Updated on 2024-11-29 GMT+08:00

View PDF

Configuring an Independently Deployed Hive Data Source

Scenario

Add a Hive data source outside a cluster on HSConsole.

Currently, HetuEngine supports data sources of the following traditional data formats: AVRO, TEXT, RCTEXT, ORC, Parquet, and SequenceFile.
When HetuEngine interconnects with Hive, you cannot specify multiple delimiters during table creation. However, if the MultiDelimitSerDe class is specified as the serialization class for a Hive data source to create a multi-delimiter table in text format, you can query the table using HetuEngine.
The Hive data source interconnected with HetuEngine supports Hudi table redirection. Hudi table access requests are redirected to the Hudi connector, so the advanced functions of the Hudi connector are available. To use this function, you need to configure the target Hudi data source, ensure that the Metastore URL of the Hudi data source is the same as that of the current Hive data source, and enable Hudi redirection for the Hive data source.

Prerequisites

The domain name of the cluster where the data source is located must be different from the HetuEngine cluster domain name.
The cluster where the data source is located and the HetuEngine cluster nodes can communicate with each other.
In the /etc/hosts file of all nodes in the cluster where HetuEngine is located, add the mapping between the host names and IP addresses of the cluster where the data source to be connected is located, and add 10.10.10.10 hadoop.System domain name in the /etc/hosts file (for example, 10.10.10.10 hadoop.hadoop.com). Otherwise, HetuEngine cannot connect to the nodes that are not in the cluster based on the host name.
A HetuEngine compute instance has been created.

Procedure

Obtain the hdfs-site.xml and core-site.xml configuration files of the Hive data source cluster.
1. Log in to FusionInsight Manager of the cluster where the Hive data source is located.
2. In the upper right corner of the homepage, click Download Client to download the complete client to the local PC as prompted.
3. Decompress the downloaded client file package and obtain the core-site.xml and hdfs-site.xml files in the FusionInsight_Cluster_1_Services_ClientConfig/HDFS/config directory.
4. Check whether the core-site.xml file contains the fs.trash.interval configuration item. If not, add the following configuration items:
```
<property>
<name>fs.trash.interval</name>
<value>2880</value>
</property>
```
5. Change the value of dfs.client.failover.proxy.provider.NameService name in the hdfs-site.xml file to org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
  For example, if the NameService name is hacluster, the configuration is as follows:
```
<property>
<name>dfs.client.failover.proxy.provider.hacluster</name>
<value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
```
  If the Hive data source to be interconnected is in the same Hadoop cluster with HetuEngine, you can log in to the HDFS client and run the following commands to obtain the hdfs-site.xml and core-site.xml configuration files. For details, see Using the HDFS Client.
  
  hdfs dfs -get /user/hetuserver/fiber/restcatalog/hive/core-site.xml
  
  hdfs dfs -get /user/hetuserver/fiber/restcatalog/hive/hdfs-site.xml
Obtain the user.keytab and krb5.conf files of the proxy user of the Hive data source.
1. Log in to FusionInsight Manager of the cluster where the Hive data source is located.
2. Choose System > Permission > User.
3. Locate the row that contains the target data source user, click More in the Operation column, and select Download Authentication Credential.
4. Decompress the downloaded package to obtain the user.keytab and krb5.conf files.
  
  The proxy user of the Hive data source must be associated with at least the hive user group.
Obtain the MetaStore URL and the Principal of the server.
1. Decompress the client package of the cluster where the Hive data source is located and obtain the hive-site.xml file from the FusionInsight_Cluster_1_Services_ClientConfig/Hive/config directory.
2. Open the hive-site.xml file and search for hive.metastore.uris. The value of hive.metastore.uris is the value of MetaStore URL. Search for hive.server2.authentication.kerberos.principal. The value of hive.server2.authentication.kerberos.principal is the value of Principal on the server.
Log in to FusionInsight Manager as a HetuEngine administrator and choose Cluster > Services > HetuEngine. The HetuEngine service page is displayed.
In the Basic Information area on the Dashboard page, click the link next to HSConsole WebUI. The HSConsole page is displayed.

Choose Data Source and click Add Data Source. Configure parameters on the Add Data Source page.

In the Basic Configuration area, configure Name and choose Hive for Data Source Type.

Configure parameters in the Hive Configuration area. For details, see Table 1.

**Table 1** Hive Configuration
Parameter	Description	Example Value
Driver	The default value is fi-hive-hadoop.	fi-hive-hadoop
hdfs-site File	Select the hdfs-site.xml configuration file obtained in 1. The file name is fixed.	-
core-site File	Select the core-site.xml configuration file obtained in 1. The file name is fixed.	-
yarn-site File	Obtain the file from the Yarn/config directory on the data source client. Upload this file only when the Hudi data source is connected.	-
krb5 File	Configure this parameter when the security mode is enabled. It is the configuration file used for Kerberos authentication. Select the krb5.conf file obtained in 2.	krb5.conf
Enable Data Source Authentication	Whether to use the permission policy of the Hive data source for authentication. If Ranger is disabled for the HetuEngine service, select Yes. If Ranger is enabled, select No.	No

In the Metastore Configuration area, configure the parameters according to Table 2.

**Table 2** MetaStore Configuration
Parameter	Description	Example Value
Metastore URL	URL of the MetaStore of the data source. For details, see 3.	thrift://10.92.8.42:21088,thrift://10.92.8.43:21088,thrift://10.92.8.44:21088
Hudi Redirection	This parameter is available only when the Metastore URL of the target Hudi data source is the same as that of the current Hive data source. This function redirects Hudi table access request to the Hudi connector, so the advanced functions of the Hudi connector can be used.	No
Hudi Data Source	This parameter is required for Hudi redirection. All configured Hudi data sources are displayed in the drop-down list box. Select only the Hudi data source that has the same Metastore URL.	-
Security Authentication Mechanism	After the security mode is enabled, the default value is KERBEROS.	KERBEROS
Server Principal	Configure this parameter when the security mode is enabled. Value of hive-site.xml in hive.server2.authentication.kerberos.principal on the data source client. It specifies the username with domain name used by meta to access MetaStore. For details, see 3.	hive/hadoop.hadoop.com@HADOOP.COM
Client Principal	Configure this parameter when the security mode is enabled. The parameter format is as follows: username for accessing MetaStore@domain name (uppercase). Username for accessing MetaStore is the user to which the user.keytab file obtained in 2 belongs. NOTE: You can log in to FusionInsight Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain, which is the current system domain name, for example, HADOOP.COM.	admintest@HADOOP.COM
Keytab File	Configure this parameter when the security mode is enabled. It specifies the keytab credential file of the MetaStore user name. The file name is fixed. Select the user.keytab file obtained in 2.	user.keytab

Configure parameters in the Connection Pool Configuration area. For details, see Table 3.

**Table 3** Connection Pool Configuration
Parameter	Description	Example Value
Enable Connection Pool	Whether the connection pool is enabled when Hive MetaStore is accessed.	Yes
Maximum Connections	Maximum number of connections between a single Coordinator and Hive Metastore. Value range: 20 to 200; default value: 50	50

Configure parameters in Hive User Information Configuration. For details, see Table 4.

Hive User Information Configuration and HetuEngine-Hive User Mapping Configuration must be used together. When HetuEngine is connected to the Hive data source, user mapping enables HetuEngine users to have the same permissions of the mapped Hive data source user. Multiple HetuEngine users can correspond to one Hive user.

**Table 4** Hive User Information Configuration
Parameter	Description
Data Source User	Data source user information If the data source user is set to hiveuser1, a HetuEngine user mapped to hiveuser1 must exist. For example, create hetuuser1 and map it to hiveuser1.
Keytab File	Obtain the authentication credential of the user corresponding to the data source.

(Optional) Configure parameters in the HetuEngine-Hive User Mapping Configuration area. For details, see Table 5.

**Table 5** HetuEngine-Hive User Mapping Configuration
Parameter	Description
HetuEngine User	HetuEngine user information.
Data Source User	Data source user information, for example, hiveuser1 (data source user configured in Table 4)

(Optional) Modify custom configurations.

You can click Add to add custom configuration parameters by referring to Table 6.

**Table 6** Custom parameters
Parameter	Description	Example Value
hive.metastore.connection.pool.maxTotal	Maximum number of connections in the connection pool.	50 (The value ranges from 20 to 200.)
hive.metastore.connection.pool.maxIdle	Maximum number of idle threads in the connection pool. When the number of idle threads reaches the maximum number, new threads are not released. Default value: 8	8 (The value ranges from 0 to 200 and cannot exceed the maximum number of connections.)
hive.metastore.connection.pool.minIdle	Minimum number of idle threads in the connection pool. When the number of idle threads reaches the minimum number, the thread pool does not create new threads. Default value: 0	0 (The value ranges from 0 to 200 and cannot exceed the value of hive.metastore.connection.pool.maxIdle.)
hive.rcfile.time-zone	Adjusts the binary-encoded timestamp value to a specific time zone. When the table storage format is RCBINARY or RCFILE, the query result of timestamp data inserted by HetuEngine in Hive 3.1.0 or later is 8 hours earlier than that in HetuEngine. In this case, set this parameter to UTC. Default value: JVM default (obtaining the local time zone from JVM)	UTC
hive.orc.use-column-names	Whether to access ORC storage files by column name. The options are as follows: true: yes false (default value): no	false
hive.parquet.use-column-names	Whether to access Parquet storage files by column name. The options are as follows: true: yes false (default value): no	false
hive.hdfs.wire-encryption.enabled	This parameter needs to be added and set to false if the hadoop.rpc.protection parameter of the HDFS is set to authentication or integrity.	false
hive.strict-mode-restrictions	You can configure the following constraints to restrict user query: NONE: no constraints DISALLOW_EXCEEDED_SCAN_ON_PARTITION (default value): The maximum number of partitions scanned in a single Hive partitioned table cannot be greater than the value of hive.max-partitions-per-scan.	DISALLOW_EXCEEDED_SCAN_ON_PARTITION
hive.ignore-absent-partitions	Query whether any file is missing in a partition. Value options are as follows: true: Queries whether files are missing in the partition. false: Do not query whether files are missing in the partition. In this case, an error is reported. If this parameter is left blank when data sources are manually connected, false is used by default.	true

You can click Delete to delete custom configuration parameters.
- You can prefix coordinator. or worker. to the custom parameters so that the parameters apply only to coordinator or worker nodes. For example, if you prefix worker. to hive.metastore.connection.pool.maxTotal, the custom parameter becomes worker.hive.metastore.connection.pool.maxTotal. If you set this new parameter to 50, it indicates that a maximum number of 50 connections are allowed for worker nodes to access Hive MetaStore. If a custom parameter is not prefixed, the custom parameter is available for both coordinator and worker nodes.
- By default, the maximum number of connections for coordinator nodes to access Hive MetaStore is 50, and the maximum and minimum numbers of idle data source connections are 8 and 0, respectively. The maximum number of connections for worker nodes to access Hive MetaStore is 20, and the maximum and minimum numbers of idle data source connections are both 0.
- hive.max-partitions-per-scan: The maximum number of partitions scanned in a single Hive partitioned table. The default value is 100000.
- The default value of hive.ignore-absent-partitions of the Hive data source co-deployed during HetuEngine installation is true.

Click OK.

Log in to the node where the cluster client is located and run the following commands to switch to the client installation directory and authenticate the user:

cd /opt/client

source bigdata_env

kinit User performing HetuEngine operations (If the cluster is in normal mode, skip this step.)
Run the following command to log in to the catalog of the data source:

hetu-cli --catalog Data source name --schema Database name

For example, run the following command:

hetu-cli --catalog hive_1 --schema default
Run the following command. If the database table information can be viewed or no error is reported, the connection is successful.

show tables;

Data Type Mapping

Currently, Hive data sources support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, REAL, DOUBLE, DECIMAL, NUMERIC, DEC, VARCHAR, VARCHAR (X), CHAR, CHAR (X), STRING, DATE, TIMESTAMP, TIME WITH TIMEZONE, TIMESTAMP WITH TIME ZONE, TIME, ARRAY, MAP, STRUCT, and ROW.

Performance Optimization

Metadata caching
Hive connectors support metadata caching to provide metadata requests for various operations faster. For details, see Adjusting Metadata Cache.
Dynamic filtering
Enabling dynamic filtering helps optimize the calculation of the Join operator of Hive connectors. For details, see Enabling Dynamic Filtering.
Query with partition conditions
Creating a partitioned table and querying data with partition filter criteria help filter out some partition data, improving performance.
INSERT statement optimization
You can improve insert performance by setting task.writer-count to 1 and choosing a larger value for hive.max-partitions-per-writers. For details, see Optimizing INSERT Statements.

Constraints

The DELETE syntax can be used to delete data from an entire table or a specified partition in a partitioned table.
The Hive metabase does not support schema renaming, that is, the ALTER SCHEMA RENAME syntax is not supported.

Parent topic: Configuring a Hive Data Source

Previous topic: Configuring a Co-deployed Hive Data Source

Next topic: Configuring a Hudi Data Source

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot