Updated on 2025-07-16 GMT+08:00

Adding a Hive Data Source

Introduction to Hive Data Sources

During HetuEngine installation, the co-deployed Hive data source (which is in the same cluster as HetuEngine) is interconnected by default. The data source name is hive and cannot be deleted. Some default configurations, such as the data source name, data source type, server principal, and client principal, cannot be modified. When the environment configuration changes, for example, the local domain name of the cluster is changed, restarting the HetuEngine service can automatically synchronize the configurations of the co-deployed Hive data source, such as server principal and client principal.

  • Currently, HetuEngine supports data sources of the following data formats: Avro, Text, RCText, ORC, Parquet, and SequenceFile.
  • When HetuEngine interconnects with Hive data sources, you cannot specify multiple delimiters during table creation. However, if the MultiDelimitSerDe class is specified as the serialization class for a Hive data source to create a multi-delimiter table in text format, you can query the table using HetuEngine.
  • The Hive data source interconnected with HetuEngine supports Hudi table redirection. This function is available to MRS 3.3.0 or later. Hudi table access requests are redirected to the Hudi connector, so the advanced functions of the Hudi connector are available. To use this function, you need to configure the target Hudi data source, ensure that the metastore URL of the Hudi data source is the same as that of the current Hive data source, and enable Hudi redirection for the Hive data source.
  • To use the isolation function of Hive Metastore, you need to configure HIVE_METASTORE_URI_HETU on Hive and restart the HSBroker instance of the HetuEngine service to update the Hive Metastore URI.

Add a Hive data source outside a cluster on HSConsole.

Prerequisites for Adding a Hive Data Source

  • The domain name of the cluster where the data source is located must be different from that of the HetuEngine cluster.
  • The cluster where the data source is located and the HetuEngine cluster nodes can communicate with each other.
  • In the /etc/hosts file of all nodes in the cluster where HetuEngine is located, add the mapping between the host names and IP addresses of the cluster where the data source to be connected is located, and add 10.10.10.10 hadoop.System domain name in the /etc/hosts file (for example, 10.10.10.10 hadoop.hadoop.com). Otherwise, HetuEngine cannot connect to the nodes that are not in the cluster based on the host name.
  • A HetuEngine compute instance has been created.

Procedure for Adding a Hive Data Source

  1. Obtain the hdfs-site.xml and core-site.xml configuration files of the Hive data source cluster.

    1. Log in to FusionInsight Manager of the cluster where the Hive data source is located.
    2. In the upper right corner of the homepage, click Download Client to download the complete client to the local PC as prompted.
    3. Decompress the downloaded client file package and obtain the core-site.xml and hdfs-site.xml files in the FusionInsight_Cluster_1_Services_ClientConfig/HDFS/config directory.
    4. Check whether the core-site.xml file contains the fs.trash.interval configuration item. If no, add the following configuration items:
      <property>
      <name>fs.trash.interval</name>
      <value>2880</value>
      </property>
    5. Change the value of dfs.client.failover.proxy.provider.NameService name in the hdfs-site.xml file to org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
      For example, if the NameService name is hacluster, the configuration is as follows:
      <property>
      <name>dfs.client.failover.proxy.provider.hacluster</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>

      If the Hive data source to be interconnected is in the same Hadoop cluster with HetuEngine, you can log in to the HDFS client and run the following commands to obtain the hdfs-site.xml and core-site.xml configuration files. For details, see Using the HDFS Client.

      hdfs dfs -get /user/hetuserver/fiber/restcatalog/hive/core-site.xml

      hdfs dfs -get /user/hetuserver/fiber/restcatalog/hive/hdfs-site.xml

  2. Obtain proxy user's user.keytab and krb5.conf files of the Hive data source.

    1. Log in to FusionInsight Manager of the cluster where the Hive data source is located.
    2. Choose System > Permission > User.
    3. Locate the row that contains the target data source user, click More in the Operation column, and select Download Authentication Credential.
    4. Decompress the downloaded package to obtain the user.keytab and krb5.conf files.

      The proxy user of the Hive data source must be associated with at least the hive user group.

  3. Obtain the metastore URL and server principal.

    1. Decompress the client package of the cluster where the Hive data source is located and obtain the hive-site.xml file from the FusionInsight_Cluster_1_Services_ClientConfig/Hive/config directory.
    2. Open the hive-site.xml file and search for hive.metastore.uris. The value of hive.metastore.uris is the value of the metastore URL. Search for hive.server2.authentication.kerberos.principal. The value of hive.server2.authentication.kerberos.principal is the value of the server principal.

  4. Log in to FusionInsight Manager as a HetuEngine administrator and choose Cluster > Services > HetuEngine.
  5. In the Basic Information area on the Dashboard page, click the link next to HSConsole WebUI to access the HSConsole page.
  6. Choose Data Source and click Add Data Source. Configure parameters on the Add Data Source page.

    1. In the Basic Configuration area, set Name and select Hive for Data Source Type.
    2. Configure parameters in the Hive Configuration area. For details, see Table 1.
      Table 1 Hive configuration

      Parameter

      Description

      Example Value

      Driver

      The default value is fi-hive-hadoop.

      fi-hive-hadoop

      hdfs-site File

      Select the hdfs-site.xml configuration file obtained in 1. The file name is fixed.

      -

      core-site File

      Select the core-site.xml configuration file obtained in 1. The file name is fixed.

      -

      yarn-site File

      Obtain the file from the Yarn/config directory on the data source client. Upload this file only when the Hudi data source is connected.

      -

      krb5 File

      Configure this parameter when the security mode is enabled.

      It is the configuration file used for Kerberos authentication. Select the krb5.conf file obtained in 2.

      krb5.conf

      Enable Data Source Authentication

      Determine whether to use the permission policy of the Hive data source for authentication.

      If Ranger is disabled for the HetuEngine service, select Yes. If Ranger is enabled, select No.

      No

    3. In the Metastore Configuration area, configure the parameters according to Table 2.
      Table 2 Metastore configuration

      Parameter

      Description

      Example Value

      Metastore URL

      URL of the metastore of the data source. For details about how to obtain the URL, see 3.

      thrift://10.92.8.42:21088,thrift://10.92.8.43:21088,thrift://10.92.8.44:21088

      Hudi Redirection

      (available for MRS 3.3.0 or later)

      This parameter is available only when the metastore URL of the target Hudi data source is the same as that of the current Hive data source.

      This function redirects Hudi table access requests from the Hive connector to the Hudi connector, so the advanced functions of the Hudi connector can be used.

      No

      Hudi Data Source

      (available for MRS 3.3.0 or later)

      This parameter is required for Hudi redirection.

      All configured Hudi data sources are displayed in the drop-down list. Select only the Hudi data source that meets the requirements on the metastore URL.

      -

      Security Authentication Mechanism

      After the security mode is enabled, the default value is KERBEROS.

      KERBEROS

      Server Principal

      Configure this parameter when the security mode is enabled. The value is the value of hive.server2.authentication.kerberos.principal in the hive-site.xml file on the target data source client.

      Specify the username with domain name used by meta to access the metastore. For details about how to obtain the value, see 3.

      hive/hadoop.hadoop.com@HADOOP.COM

      Client Principal

      Configure this parameter when the security mode is enabled.

      The parameter format is as follows: username for accessing the metastore@domain name in uppercase.

      Username for accessing the metastore is the user to which the user.keytab file obtained in 2 belongs.

      NOTE:

      You can log in to FusionInsight Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain, which is the current system domain name, for example, HADOOP.COM.

      admintest@HADOOP.COM

      Keytab File

      Configure this parameter when the security mode is enabled.

      Specify the keytab credential file of the username for accessing the metastore. The file name is fixed. Select the user.keytab file obtained in 2.

      user.keytab

    4. Configure parameters in the Connection Pool Configuration area. For details, see Table 3.
      Table 3 Connection pool configuration

      Parameter

      Description

      Example Value

      Enable Connection Pool

      Whether the connection pool is enabled when Hive Metastore is accessed.

      Yes

      Maximum Connections

      Maximum number of connections between a single Coordinator and Hive Metastore. Value range: 20 to 200; default value: 50

      50

    5. Configure parameters in Hive User Information Configuration. For details, see Table 4.
      Hive User Information Configuration and HetuEngine-Hive User Mapping Configuration must be used together. When HetuEngine is connected to the Hive data source, user mapping enables HetuEngine users to have the same permissions of the mapped Hive data source user. Multiple HetuEngine users can correspond to one Hive user.
      Table 4 Hive user information configuration

      Parameter

      Description

      Data Source User

      Data source user information

      If the data source user is set to hiveuser1, a HetuEngine user mapped to hiveuser1 must exist. For example, create hetuuser1 and map it to hiveuser1.

      Keytab File

      Obtain the authentication credential of the user corresponding to the data source.

    6. (Optional) Configure parameters in the HetuEngine-Hive User Mapping Configuration area. For details, see Table 5.
      Table 5 HetuEngine-Hive User Mapping Configuration

      Parameter

      Description

      HetuEngine User

      HetuEngine user information

      Data Source User

      Data source user information, for example, hiveuser1 (data source user configured in Table 4)

    7. (Optional) Modify custom configurations.
      • You can click Add to add custom configuration parameters by referring to Table 6.
        Table 6 Custom configuration

        Parameter

        Description

        Example Value

        hive.metastore.connection.pool.maxTotal

        Maximum number of connections in the connection pool.

        50 (The value ranges from 20 to 200.)

        hive.metastore.connection.pool.maxIdle

        Maximum number of idle threads in the connection pool. When the number of idle threads reaches the maximum number, new threads are not released.

        Default value: 8

        8 (The value ranges from 0 to 200 and cannot exceed the maximum number of connections.)

        hive.metastore.connection.pool.minIdle

        Minimum number of idle threads in the connection pool. When the number of idle threads reaches the minimum number, the thread pool does not create new threads.

        Default value: 0

        0 (The value ranges from 0 to 200 and cannot exceed the value of hive.metastore.connection.pool.maxIdle.)

        hive.rcfile.time-zone

        Adjusts the binary-encoded timestamp value to a specific time zone.

        When the table storage format is RCBINARY or RCFILE, the query result of timestamp data inserted by HetuEngine in Hive 3.1.0 or later is 8 hours earlier than that in HetuEngine. In this case, set this parameter to UTC.

        Default value: JVM default (obtaining the local time zone from JVM)

        UTC

        hive.orc.use-column-names

        Whether to access ORC storage files by column name. The options are as follows:

        • true: yes
        • false (default value): no

        false

        hive.parquet.use-column-names

        Whether to access Parquet storage files by column name.

        • true: yes
        • false (default value): no

        false

        hive.hdfs.wire-encryption.enabled

        This parameter needs to be added and set to false if the hadoop.rpc.protection parameter of the data source HDFS is set to authentication or integrity.

        false

        hive.strict-mode-restrictions

        Constraints to restrict user query.

        • NONE: no constraints
        • DISALLOW_EXCEEDED_SCAN_ON_PARTITION (default value): The maximum number of partitions scanned in a single Hive partitioned table cannot be greater than the value of hive.max-partitions-per-scan.

        DISALLOW_EXCEEDED_SCAN_ON_PARTITION

        hive.ignore-absent-partitions

        Query whether any file is missing in a partition. Value options are as follows:

        • true: Queries whether files are missing in the partition.
        • false: Do not query whether files are missing in the partition. In this case, an error is reported. If this parameter is left blank when data sources are manually connected, false is used by default.

        true

      • You can click Delete to delete custom configuration parameters.
        • You can prefix coordinator. or worker. to the custom parameters so that the parameters apply only to coordinator or worker nodes. For example, if you prefix worker. to hive.metastore.connection.pool.maxTotal, the custom parameter becomes worker.hive.metastore.connection.pool.maxTotal. If you set this new parameter to 50, it indicates that a maximum number of 50 connections are allowed for worker nodes to access Hive MetaStore. If a custom parameter is not prefixed, the custom parameter is available for both coordinator and worker nodes.
        • By default, the maximum number of connections for coordinator nodes to access Hive MetaStore is 50, and the maximum and minimum numbers of idle data source connections are 8 and 0, respectively. The maximum number of connections for worker nodes to access Hive MetaStore is 20, and the maximum and minimum numbers of idle data source connections are both 0.
        • hive.max-partitions-per-scan: The maximum number of partitions scanned in a single Hive partitioned table. The default value is 100000.
        • The default value of hive.ignore-absent-partitions of the Hive data source co-deployed during HetuEngine installation is true.
    8. Click OK.

  7. Log in to the node where the cluster client is deployed and run the following commands to go to the client installation directory and authenticate the user:

    cd /opt/client

    source bigdata_env

    kinit User performing HetuEngine operations (If the cluster is in normal mode, skip this command.)

  8. Log in to the catalog of the data source.

    hetu-cli --catalog Data source name --schema Database name

    For example, run the following command:

    hetu-cli --catalog hive_1 --schema default

  9. Run the following command. If the database table information can be viewed or no error is reported, the connection is successful.

    show tables;

Hive Data Type Mapping

Currently, Hive data sources support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, REAL, DOUBLE, DECIMAL, NUMERIC, DEC, VARCHAR, VARCHAR (X), CHAR, CHAR (X), STRING, DATE, TIMESTAMP, TIME WITH TIMEZONE, TIMESTAMP WITH TIME ZONE, TIME, ARRAY, MAP, STRUCT, and ROW.

Performance Optimization

  • Metadata caching

    Hive connectors support metadata caching to provide metadata requests for various operations faster. For details, see Adjusting HetuEngine Metadata Caching.

  • Dynamic filtering

    Enabling dynamic filtering helps optimize the calculation of the Join operator of Hive connectors. For details, see Enabling Dynamic Filtering in HetuEngine.

  • Query with partition conditions

    Creating a partitioned table and querying data with partitioning filter criteria help filter out some partition data, improving performance.

  • INSERT statement optimization

    You can improve insert performance by setting task.writer-count to 1 and choosing a larger value for hive.max-partitions-per-writers. For details, see Optimizing HetuEngine INSERT Statements.

Constraints on Hive Data Sources

  • The DELETE syntax can be used to delete data from an entire table or a specified partition in a partitioned table.
  • The Hive metabase does not support schema renaming, that is, the ALTER SCHEMA RENAME syntax is not supported.