Updated on 2025-11-17 GMT+08:00

Interconnecting Doris with the Hive Data Source

By connecting to Hive Metastore, or a metadata service compatible with Hive Metatore, Doris can automatically obtain Hive database table information and perform data queries.

In addition to Hive, many other systems also use Hive Metastore to store metadata. Through Hive Catalog, you can access not only Hive but also systems that use Hive Metastore as metadata storage, such as Iceberg and Hudi.

Constraints

  • Managed tables are supported.
  • Hive and Hudi metadata stored in Hive Metastore can be identified.
  • Hive table format can only be Parquet, ORC, or TextFile.
  • To optimize Hive table performance, limit the data volume to be read to 1 TB or less and the number of partitions to 5,000 or fewer. Avoid full table scans and queries.
  • The delete and update operations cannot be performed on Hive Catalog tables.
  • When Doris accesses fields of the Array type, the fields start from subscript 1 by default. When Hive accesses fields of the Array type, the subscript starts from 0 by default.

Prerequisite

  • A cluster containing the Doris service has been created, and all services in the cluster are running properly.
  • The node to be connected to the Doris database can communicate with the MRS cluster.
  • A user with Doris management permission has been created.
    • Kerberos authentication is enabled for the cluster (the cluster is in security mode)

      Log in to FusionInsight Manager, create a human-machine user, for example, dorisuser, create a role with Doris administrator permissions, and bind the role to the user.

      Log in to FusionInsight Manager as the created dorisuser user, and change the initial password.

    • Kerberos authentication is disabled for the cluster (the cluster is in normal mode)

      After connecting to Doris as user admin, create a role with administrator permissions, and bind the role to the user.

  • The MySQL client has been installed. For details, see Using the MySQL Client to Connect to Doris.
  • If Doris accesses Hive data across clusters through Catalog, you need to configure cross-cluster mutual trust. For details, see Configuring Cross-Manager Mutual Trust Between Clusters.

Hive Table Operations

  1. To use Doris to read Hive data stored in OBS, perform the following operations:

    1. Log in to the Huawei Cloud management console. On the console page, move the cursor to the username in the upper right corner and select My Credentials from the drop-down list.
    2. Click Access Keys, click Create Access Key, and enter the verification code or password. Click OK to generate an access key, and download it.

      Obtain the values of AWS_ACCESS_KEY and AWS_SECRET_KEY required for creating a catalog from the .csv file. The mapping is as follows:

      • The value of AWS_ACCESS_KEY is the value in the Access Key Id column in the .csv file.
      • The value of AWS_SECRET_KEY is the value in the Secret Access Key column of the .csv file.
      • Keep the CSV file properly. You can only download the file right after the access key is created. If you cannot find the file, you can create an access key again.
      • Keep your access keys secure and change them periodically for security purposes.
      • In MRS 3.3.1 and later versions, AWS_ACCESS_KEY and AWS_SECRET_KEY are replaced with obs.access_key and obs.secret_key.
    3. You can obtain the value of AWS_REGION (obs.region in MRS 3.3.1 and later versions) from Regions and Endpoints.
    4. Log in to the OBS management console, click Parallel File System, click the name of the OBS parallel file system where the Hive table resides, and view the value of Endpoint on the overview page. The value is the same as that of AWS_ENDPOINT (obs.endpoint in MRS 3.3.1 and later versions) set during catalog creation.

  2. Log in to the node where MySQL is installed and run the following command to connect to the Doris database:

    If Kerberos authentication is enabled for the cluster (the cluster is in security mode), run the following command to connect to the Doris database:

    export LIBMYSQL_ENABLE_CLEARTEXT_PLUGIN=1

    Connect to the Doris database:

    mysql -uDatabase login username -pDatabase login password -PDatabase connection port -hIP address of the Doris FE instance
    • The database connection port is the query connection port of the Doris FE. You can log in to FusionInsight Manager, choose Cluster > Services > Doris > Configurations, and query the value of query_port of the Doris service.
    • To obtain the IP address of the Doris FE instance, log in to FusionInsight Manager of the MRS cluster and choose Cluster > Services > Doris > Instances to view the IP address of any FE instance.
    • You can also use the MySQL connection software or Doris WebUI to connect to the database.

  3. Create a catalog.

    • Hive table data is stored in HDFS. Run the following command to create a catalog:
      • Kerberos authentication is enabled for the cluster (the cluster is in security mode):
        CREATE CATALOG hive_catalog PROPERTIES (
        'type'='hms',
        'hive.metastore.uris' = 'thrift://192.168.67.161:21088',
        'hive.metastore.sasl.enabled' = 'true',
        'hive.server2.thrift.sasl.qop' = 'auth-conf',
        'hive.server2.authentication' = 'KERBEROS',
        'dfs.nameservices'='hacluster',
        'dfs.ha.namenodes.hacluster'='24,25',
        'dfs.namenode.rpc-address.hacluster.24'=' IP address of the active NameNode:RPC communication port ',
        'dfs.namenode.rpc-address.hacluster.25'=' IP address of the standby NameNode:RPC communication port ',
        'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
        'hive.version' = '3.1.0',
        'yarn.resourcemanager.address' = '192.168.67.78:26004',
        'yarn.resourcemanager.principal' = 'mapred/hadoop.hadoop.com@HADOOP.COM',
        'hive.metastore.kerberos.principal' = 'hive/hadoop.hadoop.com@HADOOP.COM',
        'hadoop.security.authentication' = 'kerberos',
        'hadoop.kerberos.keytab' = '${BIGDATA_HOME}/FusionInsight_Doris_8.3.0/install/FusionInsight-Doris-1.2.3/doris-be/bin/doris.keytab',
        'hadoop.kerberos.principal' = 'doris/hadoop.hadoop.com@HADOOP.COM',
        'java.security.krb5.conf' = '${BIGDATA_HOME}/FusionInsight_BASE_*/1_16_KerberosClient/etc/krb5.conf',
        'hadoop.rpc.protection' = 'privacy'
        );
      • Kerberos authentication is disabled for the cluster (the cluster is in normal mode):
        CREATE CATALOG hive_catalog PROPERTIES (
        'type'='hms',
        'hive.metastore.uris' = 'thrift://192.168.67.161:21088',
        'hive.version' = '3.1.0',
        'hadoop.username' = 'hive',
        'yarn.resourcemanager.address' = '192.168.67.78:26004',
        'dfs.nameservices'='hacluster',
        'dfs.ha.namenodes.hacluster'='24,25',
        'dfs.namenode.rpc-address.hacluster.24'='192-168-67-172:25000',
        'dfs.namenode.rpc-address.hacluster.25'='192-168-67-78:25000',
        'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
        );
      Table 1 Introduction to Catalog-Related Parameters

      Parameter

      Description

      hive.metastore.uris

      URL of Hive MetaStore. The format is thrift://<Hive MetaStore IP address> : <Port number>. Multiple values are supported and separated by commas (,).

      dfs.nameservices

      NameService name of the cluster. The value can be found in hdfs-site.xml, which is in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory where NameNode is deployed.

      dfs.ha.namenodes.hacluster

      Cluster NameService prefix, which contains two values. The value can be found in hdfs-site.xml, which is in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory where NameNode is deployed.

      dfs.namenode.rpc-address.hacluster. xx1

      Specifies the RPC communication address of the active NameNode. You can search for the value of this configuration item in hdfs-site.xml in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory on the node where NameNode resides. xx is the value of dfs.ha.namenodes.hacluster.

      dfs.namenode.rpc-address.hacluster. xx2

      RPC communication address of the standby NameNode. You can search for the value of this configuration item in hdfs-site.xml in the ${BIGDATA_HOME}/FusionInsight_HD_*/1_*_NameNode/etc directory on the node where NameNode resides. xx is the value of dfs.ha.namenodes.hacluster.

      dfs.client.failover.proxy.provider.hacluster

      The Java class that specifies the HDFS client to connect to the Active state node in the cluster. The value is org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider.

      hive.version

      Hive version. To obtain the version, log in to FusionInsight Manager, choose Cluster > Services > Hive, and view the version on the Dashboard page.

      hadoop.rpc.protection

      Specifies whether to encrypt the RPC stream of each Hadoop module. The default value is privacy. To obtain the value, log in to FusionInsight Manager, choose Cluster > Services > HDFS > Configurations, and search for hadoop.rpc.protection.

      hive.metastore.sasl.enabled

      Indicates whether to enable the MetaStore management permission. The value is true.

      This parameter is only applicable to clusters that have Kerberos authentication enabled.

      hive.server2.thrift.sasl.qop

      Indicates whether to encrypt the interaction between HiveServer2 and the client. The value is auth-conf.

      This parameter is only applicable to clusters that have Kerberos authentication enabled.

      hive.server2.authentication

      Security authentication mode for accessing HiveServer. The value is KERBEROS.

      This parameter is only applicable to clusters that have Kerberos authentication enabled.

      hive.metastore.kerberos.principal

      Set it to the principal of the keytab file for accessing the Hive cluster.

      • For versions prior to MRS 3.5.0, the parameter value is hive/hadoop.hadoop.com@HADOOP.COM.
      • For MRS 3.5.0 and later versions, the format is hive/hadoop.${System domain name in lowercase}@${System domain name}. To obtain the system domain name, log in to Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain. For example, if the system domain name is A39A7DF8_953D_4772_B909_035A594FFA55.COM, the value of this parameter is doris/hadoop.a39a7df8_953d_4772_b909_035a594ffa55.com@A39A7DF8_953D_4772_B909_035A594FFA55.COM.

      This parameter is only applicable to clusters that have Kerberos authentication enabled.

      hadoop.security.authentication

      Security authentication mode for accessing Hadoop. The value is KERBEROS.

      This parameter is only applicable to clusters that have Kerberos authentication enabled.

      hadoop.kerberos.keytab

      Access the keytab for the Hadoop cluster, which is located at the path ${BIGDATA_HOME}/FusionInsight_Doris_*/install/FusionInsight-Doris-*/doris-be/bin/doris.keytab on the FE node. You need to copy this keytab file to all FE and BE nodes, for example, to the directory /home/omm/doris_keytab, and then execute the following command to set the group ownership of the doris.keytab file:

      chown omm:wheel /home/omm/doris_keytab -R

      This parameter is only applicable to clusters that have Kerberos authentication enabled.

      hadoop.kerberos.principal

      Set it to the principal of the keytab file for accessing the Hadoop cluster.

      • For versions prior to MRS 3.5.0, the parameter value is doris/hadoop.hadoop.com@HADOOP.COM.
      • For MRS 3.5.0 and later versions, the format is doris/hadoop.${System domain name in lowercase}@${System domain name}. To obtain the system domain name, log in to Manager, choose System > Permission > Domain and Mutual Trust, and view the value of Local Domain. For example, if the system domain name is A39A7DF8_953D_4772_B909_035A594FFA55.COM, the value of this parameter is doris/hadoop.a39a7df8_953d_4772_b909_035a594ffa55.com@A39A7DF8_953D_4772_B909_035A594FFA55.COM.

      This parameter is only applicable to clusters that have Kerberos authentication enabled.

      java.security.krb5.conf

      The krb5 file, with the value being the absolute path of the file located at ${BIGDATA_HOME}/FusionInsight_BASE_*/1_*_KerberosClient/etc/krb5.conf.

      This parameter is only applicable to clusters that have Kerberos authentication enabled.

      hadoop.username

      User name for accessing the Hadoop cluster. The value is hdfs.

      This parameter is only applicable to clusters that have not enabled Kerberos authentication.

    • Hive table data is stored in OBS. Run the following command to create a catalog. For details about related parameter values, see 1.

      Version prior to MRS 3.3.1:

      CREATE CATALOG hive_obs_catalog PROPERTIES (
      'type'='hms',
      'hive.version' = '3.1.0',
      'hive.metastore.uris' = 'thrift://192.168.67.161:21088',
      'hive.metastore.sasl.enabled' = 'true',
      'hive.server2.thrift.sasl.qop' = 'auth-conf',
      'hive.server2.authentication' = 'KERBEROS',
      'dfs.nameservices'='hacluster',
      'dfs.ha.namenodes.hacluster'='24,25',
      'dfs.namenode.rpc-address.hacluster.24'='Active NameNode IP Address:RPC Communication Port',
      'dfs.namenode.rpc-address.hacluster.25'='Standby NameNode IP Address:RPC Communication Port',
      'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
      'hive.metastore.kerberos.principal' = '${hive_keytab_principal}',
      'hadoop.security.authentication' = 'kerberos',
      'hadoop.kerberos.keytab' = '/home/omm/doris_keytab/doris.keytab',
      'hadoop.kerberos.principal' = '${doris_keytab_principal}',
      'java.security.krb5.conf' = '${BIGDATA_HOME}/FusionInsight_BASE_*/1_16_KerberosClient/etc/krb5.conf',
      'AWS_ACCESS_KEY' = 'AK',
      'AWS_SECRET_KEY' = 'SK',
      'AWS_ENDPOINT' = 'Endpoint address of OBS parallel file system',
      'AWS_REGION' = 'sa-fb-1',
      'hadoop.rpc.protection' = 'privacy'
      );

      MRS 3.3.1 and later versions:

      CREATE CATALOG hive_catalog PROPERTIES (
      'type'='hms',
      'hive.version' = '3.1.0',
      'hive.metastore.uris' = 'thrift://192.168.67.161:21088',
      'hive.metastore.sasl.enabled' = 'true',
      'hive.server2.thrift.sasl.qop' = 'auth-conf',
      'hive.server2.authentication' = 'KERBEROS',
      'dfs.nameservices'='hacluster',
      'dfs.ha.namenodes.hacluster'='24,25',
      'dfs.namenode.rpc-address.hacluster.24'='Active NameNode IP Address:RPC Communication Port',
      'dfs.namenode.rpc-address.hacluster.25'='Standby NameNode IP Address:RPC Communication Port',
      'dfs.client.failover.proxy.provider.hacluster'='org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider',
      'hive.metastore.kerberos.principal' = '${hive_keytab_principal}',
      'hadoop.security.authentication' = 'kerberos',
      'hadoop.kerberos.keytab' = '/home/omm/doris_keytab/doris.keytab',
      'hadoop.kerberos.principal' = '${doris_keytab_principal}',
      'java.security.krb5.conf' = '/opt/Bigdata/FusionInsight_BASE_8.5.0/1_16_KerberosClient/etc/krb5.conf',
      'obs.access_key' = 'AK',
      'obs.secret_key' = 'SK',
      'obs.endpoint' = 'Endpoint address of OBS parallel file system',
      'obs.region' = 'sa-fb-1',
      'hadoop.rpc.protection' = 'privacy'
      );

  4. Run the following command to query the Hive table:

    • Run the following command to query Catalog:
      show catalogs;
    • Run the following command to query the database in the Catalog:
      show databases from hive_catalog;
    • Run the following command to switch to the Catalog and access the database:
      switch hive_catalog;

      Access the specified database:

      use default;
    • To query all tables in a database in Catalog, run the following command:
      show tables from `hive_catalog`.`default`;

      To query a specified table, run the following command:

      select * from `hive_catalog`.`default`.`test_table`;

      Run the following command to view the schema of the table:

      DESC test_table;

  5. After creating or operating a Hive table, you need to refresh the table in the Doris.

    refresh catalog hive_catalog;

  6. Run the following command to perform associated query with tables in other data directories:

    SELECT h.h_shipdate FROM hive_catalog.default.htable h WHERE h.h_partkey IN (SELECT p_partkey FROM internal.db1.part) LIMIT 10;

    Among them:

    • A table is identified in catalog.database.table full restriction mode, for example, internal.db1.part.
    • catalog and database can be omitted. By default, the Catalog and Database after the switchover of SWITCH and USE are used.
    • You can run the INSERT INTO command to insert table data in Hive Catalog to an internal table in Interal Catalog to import external data catalog data.