Updated on 2024-10-09 GMT+08:00

Interconnecting Doris with the Hive Data Source

By connecting to Hive Metastore, or a metadata service compatible with Hive Metatore, Doris can automatically obtain Hive database table information and perform data queries.

In addition to Hive, many other systems also use Hive Metastore to store metadata. Through Hive Catalog, you can access not only Hive but also systems that use Hive Metastore as metadata storage, such as Iceberg and Hudi.

  • Managed tables are supported.
  • Hive and Hudi metadata stored in Hive Metastore can be identified.
  • If you want to access the catalog that is not created by the current user, you need to grant the user the permission to operate the OBS path where the catalog resides.
  • The Hive table format can only be Parquet, ORC, or TextFile.

Prerequisite

  • A cluster containing the Doris service has been created, and all services in the cluster are running properly.
  • The node to be connected to the Doris database can communicate with the MRS cluster.
  • A user with Doris management permission has been created.
    • Kerberos authentication is enabled for the cluster (the cluster is in security mode)

      Log in to FusionInsight Manager, create a human-machine user, for example, dorisuser, create a role with Doris administrator permissions, and bind the role to the user.

      Log in to FusionInsight Manager as the created dorisuser user, and change the initial password.

    • Kerberos authentication is disabled for the cluster (the cluster is in normal mode)

      After connecting to Doris as user admin, create a role with administrator permissions, and bind the role to the user.

  • The MySQL client has been installed. For details, see Using the MySQL Client to Connect to Doris.
  • If Doris imports data across clusters through Broker Load, you need to configure cross-cluster mutual trust. For details, see Configuring Cross-Manager Mutual Trust Between Clusters.

Hive Table Operations

  1. To use Doris to read Hive data stored in OBS, perform the following operations:

    1. Log in to the Huawei Cloud management console. On the console page, move the cursor to the username in the upper right corner and select My Credentials from the drop-down list.
    2. Click Access Keys, click Create Access Key, and enter the verification code or password. Click OK to generate an access key, and download it.

      Obtain the values of AWS_ACCESS_KEY and AWS_SECRET_KEY required for creating a catalog from the .csv file. The mapping is as follows:

      • The value of AWS_ACCESS_KEY is the value in the Access Key Id column in the .csv file.
      • The value of AWS_SECRET_KEY is the value in the Secret Access Key column of the .csv file.
      • Keep the CSV file properly. You can only download the file right after the access key is created. If you cannot find the file, you can create an access key again.
      • Keep your access keys secure and change them periodically for security purposes.
    3. AWS_REGION required for creating a catalog can be obtained from Regions and Endpoints.
    4. Log in to the OBS management console, click Parallel File System, click the name of the OBS parallel file system where the Hive table resides, and view the value of Endpoint on the overview page. The value is the same as that of AWS_ENDPOINT set during catalog creation.

  2. Log in to the node where MySQL is installed and run the following command to connect to the Doris database:

    If Kerberos authentication is enabled for the cluster (the cluster is in security mode), run the following command to connect to the Doris database:

    export LIBMYSQL_ENABLE_CLEARTEXT_PLUGIN=1

    mysql -uDatabase login username -pDatabase login password -PDatabase connection port -hIP address of the Doris FE instance

    • The database connection port is the query connection port of the Doris FE. You can log in to FusionInsight Manager, choose Cluster > Services > Doris > Configurations, and query the value of query_port of the Doris service.
    • To obtain the IP address of the Doris FE instance, log in to FusionInsight Manager of the MRS cluster and choose Cluster > Services > Doris > Instances to view the IP address of any FE instance.
    • You can also use the MySQL connection software or Doris WebUI to connect to the database.

  3. Create a catalog.

    • Hive table data is stored in HDFS. Run the following command to create a catalog:
      • Kerberos authentication is enabled for the cluster (the cluster is in security mode):

        CREATE CATALOG hive_catalog PROPERTIES (

        'type'='hms',

        'hive.metastore.uris' = 'thrift://192.168.67.161:21088',

        'hive.metastore.sasl.enabled' = 'true',

        'hive.server2.thrift.sasl.qop' = 'auth-conf',

        'hive.server2.authentication' = 'KERBEROS',

        'dfs.nameservices'='hacluster',

        'dfs.ha.namenodes.hacluster'='24,25',

        'dfs.namenode.rpc-address.hacluster.24'=' IP address of the active NameNode:RPC communication port ',

        'dfs.namenode.rpc-address.hacluster.25'=' IP address of the standby NameNode:RPC communication port ',

        'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',

        'hive.version' = '3.1.0',

        'yarn.resourcemanager.address' = '192.168.67.78:26004',

        'yarn.resourcemanager.principal' = 'mapred/hadoop.hadoop.com@HADOOP.COM',

        'hive.metastore.kerberos.principal' = 'hive/hadoop.hadoop.com@HADOOP.COM',

        'hadoop.security.authentication' = 'kerberos',

        'hadoop.kerberos.keytab' = '${BIGDATA_HOME}/FusionInsight_Doris_8.3.0/install/FusionInsight-Doris-1.2.3/doris-be/bin/doris.keytab',

        'hadoop.kerberos.principal' = 'doris/hadoop.hadoop.com@HADOOP.COM',

        'java.security.krb5.conf' = '${BIGDATA_HOME}/FusionInsight_BASE_*/1_16_KerberosClient/etc/krb5.conf',

        'hadoop.rpc.protection' = 'privacy'

        );

      • Kerberos authentication is disabled for the cluster (the cluster is in normal mode):

        CREATE CATALOG hive_catalog PROPERTIES (

        'type'='hms',

        'hive.metastore.uris' = 'thrift://192.168.67.161:21088',

        'hive.version' = '3.1.0',

        'hadoop.username' = 'hive',

        'yarn.resourcemanager.address' = '192.168.67.78:26004',

        'dfs.nameservices'='hacluster',

        'dfs.ha.namenodes.hacluster'='24,25',

        'dfs.namenode.rpc-address.hacluster.24'='192-168-67-172:25000',

        'dfs.namenode.rpc-address.hacluster.25'='192-168-67-78:25000',

        'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'

        );

      • hive.metastore.uris: URL of Hive MetaStore. The format is thrift://<Hive MetaStore IP address> : <Port number>. Multiple values are supported and separated by commas (,).
      • dfs.nameservices: NameService name of the cluster. The value can be found in hdfs-site.xml, which is in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory where NameNode is deployed.
      • dfs.ha.namenodes.hacluster: cluster NameService prefix, which contains two values. The value can be found in hdfs-site.xml, which is in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory where NameNode is deployed.
      • dfs.namenode.rpc-address.hacluster. xx1: specifies the RPC communication address of the active NameNode. You can search for the value of this configuration item in hdfs-site.xml in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory on the node where NameNode resides. xx is the value of dfs.ha.namenodes.hacluster.
      • dfs.namenode.rpc-address.hacluster. xx2: RPC communication address of the standby NameNode. You can search for the value of this configuration item in hdfs-site.xml in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory on the node where NameNode resides. xx is the value of dfs.ha.namenodes.hacluster.
      • To query the IP address of the active/standby instance, choose Cluster > Services > HDFS > Instances on FusionInsight Manager.
      • The default RPC port number is 25000. To query it, log in to FusionInsight Manager, choose Cluster > Services > HDFS > Configurations.
      • dfs.client.failover.proxy.provider.hacluster: specifies the Java class for the HDFS client to connect to the active node in the cluster. The value is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.
      • hive.version: Hive version. To obtain the version, log in to FusionInsight Manager, choose Cluster > Services > Hive, and view the version on the Dashboard page.
      • yarn.resourcemanager.address: indicates the IP address of the active ResourceManager instance. On FusionInsight Manager, choose Cluster > Services > Yarn > Instances to view the service IP address of the active ResourceManager instance. To obtain the port number, click Configurations and search for yarn.resourcemanager.port in the search box.
      • hadoop.rpc.protection: specifies whether to encrypt the RPC stream of each Hadoop module. The default value is privacy. To obtain the value, log in to FusionInsight Manager, choose Cluster > Services > HDFS > Configurations, and search for hadoop.rpc.protection.
      • Kerberos authentication is enabled for the cluster (the cluster is in security mode):
        • hive.metastore.sasl.enabled: indicates whether to enable the MetaStore management permission. The value is true.
        • hive.server2.thrift.sasl.qop: indicates whether to encrypt the interaction between HiveServer2 and the client. The value is auth-conf.
        • hive.server2.authentication: security authentication mode for accessing HiveServer. The value is KERBEROS.
        • yarn.resourcemanager.principal: Principal for accessing the Yarn cluster. The value is mapred/hadoop.hadoop.com@HADOOP.COM.
        • hive.metastore.kerberos.principal: Principal for accessing the Hive cluster. The value is hive/hadoop.hadoop.com@HADOOP.COM.
        • hadoop.security.authentication: security authentication mode for accessing Hadoop. The value is KERBEROS.
        • hadoop.kerberos.keytab: keytab for accessing the Hadoop cluster. The value is the path of the ${BIGDATA_HOME}/FusionInsight_Doris_*/install/FusionInsight-Doris-*/doris-be/bin/doris.keytab file.
        • hadoop.kerberos.principal: Principal for accessing the Hadoop cluster. The value is doris/hadoop.hadoop.com@HADOOP.COM.
        • java.security.krb5.conf: krb5 file. The value is the path of the ${BIGDATA_HOME}/FusionInsight_BASE_*/1_*_KerberosClient/etc/krb5.conf file.
      • Kerberos authentication is disabled for the cluster (the cluster is in normal mode):

        hadoop.username: user name for accessing the Hadoop cluster. The value is hdfs.

    • Hive table data is stored in OBS. Run the following command to create a catalog. For details about related parameter values, see 1.

      CREATE CATALOG hive_obs_catalog PROPERTIES (

      'type'='hms',

      'hive.version' = '3.1.0',

      'hive.metastore.uris' = 'thrift://192.168.67.161:21088',

      'hive.metastore.sasl.enabled' = 'true',

      'hive.server2.thrift.sasl.qop' = 'auth-conf',

      'hive.server2.authentication' = 'KERBEROS',

      'dfs.nameservices'='hacluster',

      'dfs.ha.namenodes.hacluster'='24,25',

      'dfs.namenode.rpc-address.hacluster.24'=' IP address of the active NameNode:RPC communication port ',

      'dfs.namenode.rpc-address.hacluster.25'=' IP address of the standby NameNode:RPC communication port ',

      'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',

      'yarn.resourcemanager.address' = '192.168.67.78:26004',

      'yarn.resourcemanager.principal' = 'mapred/hadoop.hadoop.com@HADOOP.COM',

      'hive.metastore.kerberos.principal' = 'hive/hadoop.hadoop.com@HADOOP.COM',

      'hadoop.security.authentication' = 'kerberos',

      'hadoop.kerberos.keytab' = '${BIGDATA_HOME}/FusionInsight_Doris_8.3.0/install/FusionInsight-Doris-1.2.3/doris-be/bin/doris.keytab',

      'hadoop.kerberos.principal' = 'doris/hadoop.hadoop.com@HADOOP.COM',

      'java.security.krb5.conf' = '${BIGDATA_HOME}/FusionInsight_BASE_*/1_16_KerberosClient/etc/krb5.conf',

      'AWS_ACCESS_KEY' = 'AK',

      'AWS_SECRET_KEY' = 'SK',

      'AWS_ENDPOINT' = ' Endpoint address of the OBS parallel file system ',

      'AWS_REGION' = 'sa-fb-1',

      'hadoop.rpc.protection' = 'privacy'

      );

  4. Run the following command to query the Hive table:

    • Run the following command to query Catalog:

      show catalogs;

    • Run the following command to query the database in the Catalog:

      show databases from hive_catalog;

    • Run the following command to switch to the Catalog and access the database:

      switch hive_catalog;

      use default;

    • To query all tables in a database in Catalog, run the following command:

      show tables from `hive_catalog`.`default`;

      To query a specified table, run the following command:

      select * from `hive_catalog`.`default`.`test_table`;

      Run the following command to view the schema of the table:

      DESC test_table;

  5. After creating or operating a Hive table, you need to refresh the table in the Doris.

    refresh catalog hive_catalog;

  6. Run the following command to perform associated query with tables in other data directories:

    SELECT h.h_shipdate FROM hive_catalog.default.htable h WHERE h.h_partkey IN (SELECT p_partkey FROM internal.db1.part) LIMIT 10;

    • A table is identified in catalog.database.table full restriction mode, for example, internal.db1.part.
    • catalog and database can be omitted. By default, the Catalog and Database after the switchover of SWITCH and USE are used.
    • You can run the INSERT INTO command to insert table data in Hive Catalog to an internal table in Interal Catalog to import external data catalog data.