Updated on 2024-10-25 GMT+08:00

Adding a Hive Data Source

Introduction to Hive Data Sources

During HetuEngine installation, the co-deployed Hive data source (which is in the same cluster as HetuEngine) is interconnected by default. The data source name is hive and cannot be deleted. Some default configurations, such as the data source name, data source type, server principal, and client principal, cannot be modified. When the environment configuration changes, for example, the local domain name of the cluster is changed, restarting the HetuEngine service can automatically synchronize the configurations of the co-deployed Hive data source, such as server principal and client principal.

  • Currently, HetuEngine supports data sources of the following data formats: AVRO, TEXT, RCTEXT, ORC, Parquet, and SequenceFile.
  • When HetuEngine interconnects with Hive, you cannot specify multiple delimiters during table creation. However, if the MultiDelimitSerDe class is specified as the serialization class for a Hive data source to create a multi-delimiter table in text format, you can query the table using HetuEngine.
  • The Hive data source interconnected with HetuEngine supports Hudi table redirection. This function is available to MRS 3.3.0 or later. Hudi table access requests are redirected to the Hudi connector, so the advanced functions of the Hudi connector are available. To use this function, you need to configure the target Hudi data source, ensure that the Metastore URL of the Hudi data source is the same as that of the current Hive data source, and enable Hudi redirection for the Hive data source.
  • To use the isolation function of Hive Metastore, you need to configure HIVE_METASTORE_URI_HETU on Hive and restart the Hsbroke instance of the HetuEngine service to update the Hive Metastore URI.

Add a Hive data source outside a cluster on HSConsole.

Prerequisites for Adding a Hive Data Source

  • The domain name of the cluster where the data source is located must be different from the HetuEngine cluster domain name.
  • The cluster where the data source is located and the HetuEngine cluster nodes can communicate with each other.
  • In the /etc/hosts file of all nodes in the cluster where HetuEngine is located, add the mapping between the host names and IP addresses of the cluster where the data source to be connected is located, and add 10.10.10.10 hadoop.System domain name in the /etc/hosts file (for example, 10.10.10.10 hadoop.hadoop.com). Otherwise, HetuEngine cannot connect to the nodes that are not in the cluster based on the host name.
  • A HetuEngine compute instance has been created.

Procedure for Adding a Hive Data Source

  1. Obtain the hdfs-site.xml and core-site.xml configuration files of the Hive data source cluster.

    1. Log in to FusionInsight Manager of the cluster where the Hive data source is located.
    2. In the upper right corner of the homepage, click Download Client to download the complete client to the local PC as prompted.
    3. Decompress the downloaded client file package and obtain the core-site.xml and hdfs-site.xml files in the FusionInsight_Cluster_1_Services_ClientConfig/HDFS/config directory.
    4. Check whether the core-site.xml file contains the fs.trash.interval configuration item. If not, add the following configuration items:
      <property>
      <name>fs.trash.interval</name>
      <value>2880</value>
      </property>
    5. Change the value of dfs.client.failover.proxy.provider.NameService name in the hdfs-site.xml file to org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
      For example, if the NameService name is hacluster, the configuration is as follows:
      <property>
      <name>dfs.client.failover.proxy.provider.hacluster</name>
      <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
      </property>

      If the Hive data source to be interconnected is in the same Hadoop cluster with HetuEngine, you can log in to the HDFS client and run the following commands to obtain the hdfs-site.xml and core-site.xml configuration files. For details, see Using the HDFS Client.

      hdfs dfs -get /user/hetuserver/fiber/restcatalog/hive/core-site.xml

      hdfs dfs -get /user/hetuserver/fiber/restcatalog/hive/hdfs-site.xml

  2. Obtain the user.keytab and krb5.conf files of the proxy user of the Hive data source.

    1. Log in to FusionInsight Manager of the cluster where the Hive data source is located.
    2. Choose System > Permission > User.
    3. Locate the row that contains the target data source user, click More in the Operation column, and select Download Authentication Credential.
    4. Decompress the downloaded package to obtain the user.keytab and krb5.conf files.

      The proxy user of the Hive data source must be associated with at least the hive user group.

  3. Obtain the MetaStore URL and the Principal of the server.

    1. Decompress the client package of the cluster where the Hive data source is located and obtain the hive-site.xml file from the FusionInsight_Cluster_1_Services_ClientConfig/Hive/config directory.
    2. Open the hive-site.xml file and search for hive.metastore.uris. The value of hive.metastore.uris is the value of MetaStore URL. Search for hive.server2.authentication.kerberos.principal. The value of hive.server2.authentication.kerberos.principal is the value of Principal on the server.

  4. Log in to FusionInsight Manager as a HetuEngine administrator and choose Cluster > Services > HetuEngine. The HetuEngine service page is displayed.
  5. In the Basic Information area on the Dashboard page, click the link next to HSConsole WebUI. The HSConsole page is displayed.
  6. Choose Data Source and click Add Data Source. Configure parameters on the Add Data Source page.

    1. In the Basic Configuration area, configure Name and choose Hive for Data Source Type.
    2. Configure parameters in the Hive Configuration area. For details, see Table 1.
      Table 1 Hive Configuration

      Parameter

      Description

      Example Value

      Driver

      The default value is fi-hive-hadoop.

      fi-hive-hadoop

      hdfs-site File

      Select the hdfs-site.xml configuration file obtained in 1. The file name is fixed.

      -

      core-site File

      Select the core-site.xml configuration file obtained in 1. The file name is fixed.

      -

      yarn-site File

      Obtain the file from the Yarn/config directory on the data source client. Upload this file only when the Hudi data source is connected.

      -

      krb5 File

      Configure this parameter when the security mode is enabled.

      It is the configuration file used for Kerberos authentication. Select the krb5.conf file obtained in 2.

      krb5.conf

      Enable Data Source Authentication

      Whether to use the permission policy of the Hive data source for authentication.

      If Ranger is disabled for the HetuEngine service, select Yes. If Ranger is enabled, select No.

      No

    3. In the Metastore Configuration area, configure the parameters according to Table 2.
      Table 2 MetaStore Configuration

      Parameter

      Description

      Example Value

      Metastore URL

      URL of the MetaStore of the data source. For details, see 3.

      thrift://10.92.8.42:21088,thrift://10.92.8.43:21088,thrift://10.92.8.44:21088

      Hudi Redirection

      (available for MRS 3.3.0 or later)

      This parameter is available only when the Metastore URL of the target Hudi data source is the same as that of the current Hive data source.

      This function redirects Hudi table access request to the Hudi connector, so the advanced functions of the Hudi connector can be used.

      No

      Hudi Data Source

      (available for MRS 3.3.0 or later)

      This parameter is required for Hudi redirection.

      All configured Hudi data sources are displayed in the drop-down list box. Select only the Hudi data source that has the same Metastore URL.

      -

      Security Authentication Mechanism

      After the security mode is enabled, the default value is KERBEROS.

      KERBEROS

      Server Principal

      Configure this parameter when the security mode is enabled. Value of hive-site.xml in hive.server2.authentication.kerberos.principal on the data source client.

      It specifies the username with domain name used by meta to access MetaStore. For details, see 3.

      hive/hadoop.hadoop.com@HADOOP.COM

      Client Principal

      Configure this parameter when the security mode is enabled.

      The parameter format is as follows: username for accessing MetaStore@domain name (uppercase).

      Username for accessing MetaStore is the user to which the user.keytab file obtained in 2 belongs.

      NOTE:

      You can log in to FusionInsight Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain, which is the current system domain name, for example, HADOOP.COM.

      admintest@HADOOP.COM

      Keytab File

      Configure this parameter when the security mode is enabled.

      It specifies the keytab credential file of the MetaStore user name. The file name is fixed. Select the user.keytab file obtained in 2.

      user.keytab

    4. Configure parameters in the Connection Pool Configuration area. For details, see Table 3.
      Table 3 Connection Pool Configuration

      Parameter

      Description

      Example Value

      Enable Connection Pool

      Whether the connection pool is enabled when Hive MetaStore is accessed.

      Yes

      Maximum Connections

      Maximum number of connections between a single Coordinator and Hive Metastore. Value range: 20 to 200; default value: 50

      50

    5. Configure parameters in Hive User Information Configuration. For details, see Table 4.
      Hive User Information Configuration and HetuEngine-Hive User Mapping Configuration must be used together. When HetuEngine is connected to the Hive data source, user mapping enables HetuEngine users to have the same permissions of the mapped Hive data source user. Multiple HetuEngine users can correspond to one Hive user.
      Table 4 Hive User Information Configuration

      Parameter

      Description

      Data Source User

      Data source user information

      If the data source user is set to hiveuser1, a HetuEngine user mapped to hiveuser1 must exist. For example, create hetuuser1 and map it to hiveuser1.

      Keytab File

      Obtain the authentication credential of the user corresponding to the data source.

    6. (Optional) Configure parameters in the HetuEngine-Hive User Mapping Configuration area. For details, see Table 5.
      Table 5 HetuEngine-Hive User Mapping Configuration

      Parameter

      Description

      HetuEngine User

      HetuEngine user information.

      Data Source User

      Data source user information, for example, hiveuser1 (data source user configured in Table 4)

    7. (Optional) Modify custom configurations.
      • You can click Add to add custom configuration parameters by referring to Table 6.
        Table 6 Custom parameters

        Parameter

        Description

        Example Value

        hive.metastore.connection.pool.maxTotal

        Maximum number of connections in the connection pool.

        50 (The value ranges from 20 to 200.)

        hive.metastore.connection.pool.maxIdle

        Maximum number of idle threads in the connection pool. When the number of idle threads reaches the maximum number, new threads are not released.

        Default value: 8

        8 (The value ranges from 0 to 200 and cannot exceed the maximum number of connections.)

        hive.metastore.connection.pool.minIdle

        Minimum number of idle threads in the connection pool. When the number of idle threads reaches the minimum number, the thread pool does not create new threads.

        Default value: 0

        0 (The value ranges from 0 to 200 and cannot exceed the value of hive.metastore.connection.pool.maxIdle.)

        hive.rcfile.time-zone

        Adjusts the binary-encoded timestamp value to a specific time zone.

        When the table storage format is RCBINARY or RCFILE, the query result of timestamp data inserted by HetuEngine in Hive 3.1.0 or later is 8 hours earlier than that in HetuEngine. In this case, set this parameter to UTC.

        Default value: JVM default (obtaining the local time zone from JVM)

        UTC

        hive.orc.use-column-names

        Whether to access ORC storage files by column name. The options are as follows:

        • true: yes
        • false (default value): no

        false

        hive.parquet.use-column-names

        Whether to access Parquet storage files by column name. The options are as follows:

        • true: yes
        • false (default value): no

        false

        hive.hdfs.wire-encryption.enabled

        This parameter needs to be added and set to false if the hadoop.rpc.protection parameter of the HDFS is set to authentication or integrity.

        false

        hive.strict-mode-restrictions

        You can configure the following constraints to restrict user query:

        • NONE: no constraints
        • DISALLOW_EXCEEDED_SCAN_ON_PARTITION (default value): The maximum number of partitions scanned in a single Hive partitioned table cannot be greater than the value of hive.max-partitions-per-scan.

        DISALLOW_EXCEEDED_SCAN_ON_PARTITION

        hive.ignore-absent-partitions

        Query whether any file is missing in a partition. Value options are as follows:

        • true: Queries whether files are missing in the partition.
        • false: Do not query whether files are missing in the partition. In this case, an error is reported. If this parameter is left blank when data sources are manually connected, false is used by default.

        true

      • You can click Delete to delete custom configuration parameters.
        • You can prefix coordinator. or worker. to the custom parameters so that the parameters apply only to coordinator or worker nodes. For example, if you prefix worker. to hive.metastore.connection.pool.maxTotal, the custom parameter becomes worker.hive.metastore.connection.pool.maxTotal. If you set this new parameter to 50, it indicates that a maximum number of 50 connections are allowed for worker nodes to access Hive MetaStore. If a custom parameter is not prefixed, the custom parameter is available for both coordinator and worker nodes.
        • By default, the maximum number of connections for coordinator nodes to access Hive MetaStore is 50, and the maximum and minimum numbers of idle data source connections are 8 and 0, respectively. The maximum number of connections for worker nodes to access Hive MetaStore is 20, and the maximum and minimum numbers of idle data source connections are both 0.
        • hive.max-partitions-per-scan: The maximum number of partitions scanned in a single Hive partitioned table. The default value is 100000.
        • The default value of hive.ignore-absent-partitions of the Hive data source co-deployed during HetuEngine installation is true.
    8. Click OK.

  7. Log in to the node where the cluster client is located and run the following commands to switch to the client installation directory and authenticate the user:

    cd /opt/client

    source bigdata_env

    kinit User performing HetuEngine operations (If the cluster is in normal mode, skip this step.)

  8. Run the following command to log in to the catalog of the data source:

    hetu-cli --catalog Data source name --schema Database name

    For example, run the following command:

    hetu-cli --catalog hive_1 --schema default

  9. Run the following command. If the database table information can be viewed or no error is reported, the connection is successful.

    show tables;

Hive Data Type Mapping

Currently, Hive data sources support the following data types: BOOLEAN, TINYINT, SMALLINT, INT, BIGINT, REAL, DOUBLE, DECIMAL, NUMERIC, DEC, VARCHAR, VARCHAR (X), CHAR, CHAR (X), STRING, DATE, TIMESTAMP, TIME WITH TIMEZONE, TIMESTAMP WITH TIME ZONE, TIME, ARRAY, MAP, STRUCT, and ROW.

Performance Optimization

  • Metadata caching

    Hive connectors support metadata caching to provide metadata requests for various operations faster. For details, see Adjusting HetuEngine Metadata Caching.

  • Dynamic filtering

    Enabling dynamic filtering helps optimize the calculation of the Join operator of Hive connectors. For details, see Enabling Dynamic Filtering in HetuEngine.

  • Query with partitioning conditions

    Creating a partitioned table and querying data with partitioning filter criteria help filter out some partition data, improving performance.

  • INSERT statement optimization

    You can improve insert performance by setting task.writer-count to 1 and choosing a larger value for hive.max-partitions-per-writers. For details, see Optimizing HetuEngine INSERT Statements.

Constraints on Hive Data Sources

  • The DELETE syntax can be used to delete data from an entire table or a specified partition in a partitioned table.
  • The Hive metabase does not support schema renaming, that is, the ALTER SCHEMA RENAME syntax is not supported.