Updated on 2024-11-29 GMT+08:00

Hive

By connecting to Hive Metastore, or a metadata service compatible with Hive Metatore, Doris can automatically obtain Hive database table information and perform data queries.

In addition to Hive, many other systems also use the Hive Metastore to store metadata. Through Hive Catalog, we can access Hive, and access systems, such as Iceberg and Hudi, that use Hive Metastore as metadata storage.

  • Managed Table is supported.
  • Hive and Hudi metadata stored in Hive Metastore can be identified.
  • If you want to access a catalog that is not created by the current user, you need to grant the user the permission to operate the OBS path where the catalog is.
  • The Hive table format can only be Parquet, ORC, or TextFile.

Prerequisite

  • A cluster containing the Doris service has been created, and all services in the cluster are running properly.
  • The nodes to be connected to the Doris database can communicate with the MRS cluster.
  • A user with Doris management permission has been created.
    • Kerberos authentication is enabled for the cluster (the cluster is in security mode)

      Log in to FusionInsight Manager, create a human-machine user, for example, dorisuser, create a role with Doris administrator permissions, and bind the role to the user.

      Log in to FusionInsight Manager as the new user dorisuser and change the initial password.

    • Kerberos authentication is disabled for the cluster (the cluster is in normal mode)

      After connecting to Doris as user admin, create a role with administrator permissions, and bind the role to the user.

  • The MySQL client has been installed. For details, see Installing a MySQL Client.

Hive Table Operations

  1. Perform the following operations to read Hive data stored in OBS with Doris:

    1. Log in to the MRS management console. Move the cursor to the username in the upper right corner and select My Credentials from the drop-down list.
    2. Click Access Keys, click Create Access Key, and enter the verification code or password. Click OK to generate an access key, and download it.

      Obtain the values of obs.access_key and obs.secret_key required for creating a catalog from the .csv file. The mapping is as follows:

      • The value of obs.access_key is the value in the Access Key Id column of the .csv file.
      • The value of obs.secret_key is the value in the Secret Access Key column of the .csv file.
      • Keep the CSV file properly. You can only download the file right after the access key is created. If you cannot find the file, you can create an access key again.
      • Keep your access keys secure and change them periodically for security purposes.
    3. You can obtain the value of obs.region from .
    4. Log in to the OBS management console, click Parallel File System, click the name of the OBS parallel file system where the Hive table is stored, and view the value of Endpoint on the overview page. The value is the same as that of obs.endpointT set during catalog creation.

  2. Log in to the node where MySQL is installed and connect the Doris database.

    If Kerberos authentication is enabled for the cluster (the cluster is in security mode), run the following command to connect to the Doris database:

    export LIBMYSQL_ENABLE_CLEARTEXT_PLUGIN=1

    mysql -uDatabase login username -pDatabase login password -PConnection port for FE queries -hIP address of the Doris FE instance

    • To obtain the query connection port of the Doris FE instance, you can log in to FusionInsight Manager, choose Cluster > Services > Doris > Configurations, and query the value of query_port of the Doris service.
    • To obtain the IP address of the Doris FE instance, log in to FusionInsight Manager of the MRS cluster and choose Cluster > Services > Doris > Instances to view the IP address of any FE instance.
    • You can also use the MySQL connection software or Doris web UI to connect the database.

  3. Create a catalog.

    • Hive table data is stored in HDFS. Run the following command to create a catalog:
      • Kerberos authentication is enabled for the cluster (the cluster is in security mode):

        CREATE CATALOG hive_catalog PROPERTIES (

        'type'='hms',

        'hive.metastore.uris' = 'thrift://192.168.67.161:21088',

        'hive.metastore.sasl.enabled' = 'true',

        'hive.server2.thrift.sasl.qop' = 'auth-conf',

        'hive.server2.authentication' = 'KERBEROS',

        'dfs.nameservices'='hacluster',

        'dfs.ha.namenodes.hacluster'='24,25',

        'dfs.namenode.rpc-address.hacluster.24'=' IP address of the active NameNode:RPC communication port',

        'dfs.namenode.rpc-address.hacluster.25'=' IP address of the active NameNode:RPC communication port',

        'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',

        'hive.version' = '3.1.0',

        'yarn.resourcemanager.address' = '192.168.67.78:26004',

        'yarn.resourcemanager.principal' = 'mapred/hadoop.hadoop.com@HADOOP.COM',

        'hive.metastore.kerberos.principal' = 'hive/hadoop.hadoop.com@HADOOP.COM',

        'hadoop.security.authentication' = 'kerberos',

        'hadoop.kerberos.keytab' = '${BIGDATA_HOME}/FusionInsight_Doris_8.3.1/install/FusionInsight-Doris-2.0.3/doris-be/bin/doris.keytab',

        'hadoop.kerberos.principal' = 'doris/hadoop.hadoop.com@HADOOP.COM',

        'java.security.krb5.conf' = '${BIGDATA_HOME}/FusionInsight_BASE_*/1_16_KerberosClient/etc/krb5.conf',

        'hadoop.rpc.protection' = 'privacy'

        );

      • Kerberos authentication is disabled for the cluster (the cluster is in normal mode):

        CREATE CATALOG hive_catalog PROPERTIES (

        'type'='hms',

        'hive.metastore.uris' = 'thrift://192.168.67.161:21088',

        'hive.version' = '3.1.0',

        'hadoop.username' = 'hive',

        'yarn.resourcemanager.address' = '192.168.67.78:26004',

        'dfs.nameservices'='hacluster',

        'dfs.ha.namenodes.hacluster'='24,25',

        'dfs.namenode.rpc-address.hacluster.24'='192-168-67-172:25000',

        'dfs.namenode.rpc-address.hacluster.25'='192-168-67-78:25000',

        'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'

        );

      • hive.metastore.uris: URL of Hive MetaStore. The format is thrift://<IP address of Hive MetaStore>:<Port number >. Multiple values are supported and need to be separated by commas (,).
      • dfs.nameservices: NameService name of the cluster. The value can be found in hdfs-site.xml, which is in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory on the node where NameNode is deployed.
      • dfs.ha.namenodes.hacluster: prefix of NameService node in a cluster, which contains two values. The value can be found in hdfs-site.xml, which is in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory on the node where NameNode is deployed.
      • dfs.namenode.rpc-address.hacluster.xx1: RPC communication address of the active NameNode. You can search for the value of this configuration item in hdfs-site.xml in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory on the node where NameNode is deployed. xx is the value of dfs.ha.namenodes.hacluster.
      • dfs.namenode.rpc-address.hacluster.xx2: RPC communication address of the standby NameNode. You can search for the value of this configuration item in hdfs-site.xml in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory on the node where NameNode is deployed. xx is the value of dfs.ha.namenodes.hacluster.
      • dfs.client.failover.proxy.provider.hacluster: Java class for the HDFS client to connect the active node in the cluster. The value is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
      • hive.version: Hive version. To obtain the version, log in to FusionInsight Manager, choose Cluster > Services > Hive, and view the version on the Dashboard page.
      • yarn.resourcemanager.address: IP address of the active ResourceManager instance. On FusionInsight Manager, choose Cluster > Services > Yarn > Instances to view the service IP address of the active ResourceManager instance.
      • hadoop.rpc.protection: whether to encrypt the RPC stream of each Hadoop module. The default value is privacy. To obtain the value, log in to FusionInsight Manager, choose Cluster > Services > HDFS > Configurations, and search for hadoop.rpc.protection.
      • Kerberos authentication is enabled for the cluster (the cluster is in security mode):
        • hive.metastore.sasl.enabled: whether to enable MetaStore management permission. The value is true.
        • hive.server2.thrift.sasl.qop: whether to encrypt the interaction between HiveServer2 and the client. The value is auth-conf.
        • hive.server2.authentication: security authentication for accessing HiveServer. The value is KERBEROS.
        • yarn.resourcemanager.principal: Principal for accessing the Yarn cluster. The value is mapred/hadoop.hadoop.com@HADOOP.COM.
        • hive.metastore.kerberos.principal: Principal for accessing the Hive cluster. The value is hive/hadoop.hadoop.com@HADOOP.COM.
        • hadoop.security.authentication: security authentication for accessing Hadoop. The value is KERBEROS.
        • hadoop.kerberos.keytab: keytab for accessing the Hadoop cluster. The value is the path of the ${BIGDATA_HOME}/FusionInsight_Doris_*/install/FusionInsight-Doris-*/doris-be/bin/doris.keytab file.
        • hadoop.kerberos.principal: Principal for accessing the Hadoop cluster. The value is doris/hadoop.hadoop.com@HADOOP.COM.
        • java.security.krb5.conf: krb5 file. The value is the path of the ${BIGDATA_HOME}/FusionInsight_BASE_*/1_*_KerberosClient/etc/krb5.conf file.
      • Kerberos authentication is disabled for the cluster (the cluster is in normal mode):

        hadoop.username: username for accessing the Hadoop cluster. The value is hdfs.

    • Hive table data is stored in OBS. Run the following command to create a catalog. For details about related parameter values, see 1.

      CREATE CATALOG hive_obs_catalog PROPERTIES (

      'type'='hms',

      'hive.version' = '3.1.0',

      'hive.metastore.uris' = 'thrift://192.168.67.161:21088',

      'obs.access_key' = 'AK',

      'obs.secret_key' = 'SK',

      'obs.endpoint' = 'Endpoint address of the OBS parallel file system ',

      'obs.region' = 'sa-fb-1'

      );

  4. Query the Hive table:

    • Query catalogs:

      show catalogs;

    • Query the databases in the catalog:

      show databases from hive_catalog;

    • Switch the catalog and access the database:

      switch hive_catalog;

      use default;

    • Query all tables in a database in the catalog:

      show tables from `hive_catalog`.`default`;

      Query a specified table:

      select * from `hive_catalog`.`default`.`test_table`;

      View the schema of the table:

      DESC test_table;

  5. After creating or operating a Hive table, you need to refresh the table in Doris.

    refresh catalog hive_catalog;

  6. Perform an associated query with tables in other data catalog:

    SELECT h.h_shipdate FROM hive_catalog.default.htable h WHERE h.h_partkey IN (SELECT p_partkey FROM internal.db1.part) LIMIT 10;

    • Identify a table with catalog.database.table full restriction, for example, internal.db1.part.
    • catalog and database can be omitted. If omitted, the catalog and database switched to by SWITCH and USE are used.
    • You can run the INSERT INTO command to insert table data in the Hive catalog to an internal table in the internal catalog.