Updated on 2023-05-30 GMT+08:00

Configuring an HBase Data Source

Scenario

This section describes how to add an HBase data source on HSConsole.

Prerequisites

  • The domain name of the cluster where the data source is located must be different from the HetuEngine cluster domain name.
  • The cluster where the data source is located and the HetuEngine cluster nodes can communicate with each other.
  • A HetuEngine compute instance has been created.
  • The SSL communication encryption configuration of ZooKeeper in the cluster where the data source is located must be the same as that of ZooKeeper in the cluster where HetuEngine is located.

    To check whether SSL communication encryption is enabled, log in to FusionInsight Manager, choose Cluster > Services > ZooKeeper > Configurations > All Configurations, and enter ssl.enabled in the search box. If the value of ssl.enabled is true, SSL communication encryption is enabled. If the value is false, SSL communication encryption is disabled.

Procedure

  1. Obtain the hbase-site.xml, hdfs-site.xml, and core-site.xml configuration files of the HBase data source.

    1. Log in to FusionInsight Manager of the cluster where the HBase data source is located.
    2. Choose Cluster > Dashboard.
    3. Choose More > Download Client and download the client file as prompted.
    4. Decompress the downloaded client file package and obtain the hbase-site.xml, core-site.xml, and hdfs-site.xml files in the FusionInsight_Cluster_1_Services_ClientConfig/HBase/config directory.
    5. If hbase.rpc.client.impl exists in the hbase-site.xml file, change the value of hbase.rpc.client.impl to org.apache.hadoop.hbase.ipc.RpcClientImpl (The client invokes the remote RPC through RpcClientImpl.).
      <property>
      <name>hbase.rpc.client.impl</name>
      <value>org.apache.hadoop.hbase.ipc.RpcClientImpl</value>
      </property>

      In addition, if the hdfs-site.xml and hbase-site.xml files reference the host name of a non-HetuEngine cluster node, you need to add the mapping between the referenced host name and the corresponding IP address to the /etc/hosts file of each node in the HetuEngine cluster. Otherwise, HetuEngine cannot connect to the node that is not in this cluster based on the host name.

  2. Obtain the user.keytab and krb5.conf files of the proxy user of the HBase data source.

    1. Log in to FusionInsight Manager of the cluster where the HBase data source is located.
    2. Choose System > Permission > User.
    3. Locate the row that contains the target data source user, click More in the Operation column, and select Download Authentication Credential.
    4. Decompress the downloaded package to obtain the user.keytab and krb5.conf files.

    The proxy user of the data source must have the permission to perform HBase operations.

  3. Log in to FusionInsight Manager as a HetuEngine administrator and choose Cluster > Services > HetuEngine. The HetuEngine service page is displayed.
  4. In the Basic Information area on the Dashboard page, click the link next to HSConsole WebUI. The HSConsole page is displayed.
  1. Choose Data Source and click Add Data Source. Configure parameters on the Add Data Source page.

    1. In the Basic Configuration area, configure Name and choose HBase for Data Source Type.
    2. Configure parameters in the HBase Configuration area. For details, see Table 1.
      Table 1 HBase Configuration

      Parameter

      Description

      Example Value

      Driver

      The default value is hbase-connector.

      hbase-connector

      ZooKeeper Quorum Address

      Service IP addresses of all quorumpeer instances of the ZooKeeper service for the data source. If the ZooKeeper service of the data source uses IPv6, you need to specify the client port number in the ZooKeeper Quorum address.

      Log in to FusionInsight Manager, choose Cluster > Services > ZooKeeper > Instance, and view the IP addresses of all the hosts housing the quorumpeer instances.

      • IPv4: 10.0.136.132,10.0.136.133,10.0.136.134
      • IPv6: [0.0.0.0.0.0.0.0]:24002

      ZooKeeper Client Port Number

      Port number of the ZooKeeper client.

      Log in to FusionInsight Manager and choose Cluster > Service > ZooKeeper. On the Configurations tab page, check the value of clientPort.

      2181

      HBase RPC Communication Protection

      Set this parameter based on the value of hbase.rpc.protection in the hbase-site.xml file obtained in 1.

      • If the value is authentication, set this parameter to No.
      • If the value is privacy, set this parameter to Yes.

      No

      Security Authentication Mechanism

      After the security mode is enabled, the default value is KERBEROS.

      KERBEROS

      Principal

      Configure this parameter when the security authentication mechanism is enabled. Set the parameter to the user to whom the user.keytab file obtained in 2 belongs.

      user_hbase@HADOOP2.COM

      Keytab File

      Configure this parameter when the security mode is enabled. It specifies the security authentication key. Select the user.keytab file obtained in 2.

      user.keytab

      krb5 File

      Configure this parameter when the security mode is enabled. It is the configuration file used for Kerberos authentication. Select the krb5.conf file obtained in 2.

      krb5.conf

      hbase-site File

      Configure this parameter when the security mode is enabled. It is the configuration file required for connecting to HDFS. Select the hbase-site.xml file obtained in 1.

      hbase-site.xml

      core-site File

      Configure this parameter when the security mode is enabled. This file is required for connecting to HDFS. Select the core-site.xml file obtained in 1.

      core-site.xml

      hdfs-site File

      Configure this parameter when the security mode is enabled. This file is required for connecting to HDFS. Select the hdfs-site.xml file obtained in 1.

      hdfs-site.xml

    3. (Optional) Customize the configuration.
    4. Click OK.

  2. Log in to the node where the cluster client is located and run the following commands to switch to the client installation directory and authenticate the user:

    cd /opt/client

    source bigdata_env

    kinit User performing HetuEngine operations (If the cluster is in normal mode, skip this step.)

  3. Run the following command to log in to the catalog of the data source:

    hetu-cli --catalog Data source name --schema Database name

    For example, run the following command:

    hetu-cli --catalog hbase_1 --schema default

  4. Run the following command. If the database table information can be viewed or no error is reported, the connection is successful.

    show tables;

  5. Create a structured mapping table.

    The format of the statement for creating a mapping table is as follows:
    CREATE TABLE schemaName.tableName (
      rowId VARCHAR,
      qualifier1 TINYINT,
      qualifier2 SMALLINT,
      qualifier3 INTEGER,
      qualifier4 BIGINT,
      qualifier5 DOUBLE,
      qualifier6 BOOLEAN,
      qualifier7 TIME,
      qualifier8 DATE,
      qualifier9 TIMESTAMP
    )
    WITH (
    column_mapping = 'qualifier1:f1:q1,qualifier2:f1:q2,qualifier3:f2:q3,qualifier4:f2:q4,qualifier5:f2:q5,qualifier6:f3:q1,qualifier7:f3:q2,qualifier8:f3:q3,qualifier9:f3:q4',
    row_id = 'rowId',
    hbase_table_name = 'hbaseNamespace:hbaseTable',
    external = true
    );

    The value of schemaName must be the same as that of hbaseNamespace in hbase_table_name and can contain only lowercase letters.

    • Supported mapping tables: Mapping tables can be directly associated with tables in the HBase data source or created and associated with new tables that do not exist in the HBase data source.
    • Supported data types in a mapping table: VARCHAR, TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, BOOLEAN, TIME, DATE, and TIMESTAMP
    • The following table describes the keywords in the statements for creating mapping tables.
      Table 2 Keywords in the statements for creating mapping tables

      Keyword

      Type

      Mandatory

      Default Value

      Remarks

      column_mapping

      String

      No

      All columns belong to the same Family column family.

      Specify the mapping between columns in the mapping table and column families in the HBase data source table. If a table in the HBase data source needs to be associated, the value of column_mapping must be the same as that in the HBase data source. If you create a table that does not exist in the HBase data source, you need to specify column_mapping.

      row_id

      String

      No

      First column in the mapping table

      Column name corresponding to the rowkey table in the HBase data source

      hbase_table_name

      String

      No

      N/A

      Tablespace and table name of the HBase data source to be associated. Use a colon (:) to separate them. The default tablespace is default. If a new table that does not exist in the HBase data source is created, hbase_table_name does not need to be specified.

      external

      Boolean

      No

      true

      If external is set to true, the table is a mapping table in the HBase data source and the original table in the HBase data source cannot be deleted. If external is set to false, the table in the HBase data source is deleted when the Hetu-HBase table is deleted.

Data Type Mapping

HBase is a byte-based distributed storage system that stores all data types as byte arrays. To represent HBase data in HetuEngine, select a data type that matches the value of the HBase column qualifier for the HetuEngine column qualifier by creating a mapping table in HetuEngine.

Currently, HetuEngine column qualifiers support the following data types: VARCHAR, TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, BOOLEAN, TIME, DATE, and TIMESTAMP.

Performance Optimization

  • Predicate pushdown

    Queries support pushdown of most operators, for example, point query and range query based on row keys.

    The following predicate conditions are supported: =, >=, >, <, <=, !=, IN, NOT IN, IS NULL, IS NOT NULL, and BETWEEN AND.

  • Batch GET query

    Multiple row keys to be queried are encapsulated into one List<Get> in the HBase API, and then the list is requested to query data. In this way, each row key does not need to initiate a request separately.

  • HBase single-table query range scanning optimization

    The HBase single-table query range scanning optimization is to automatically infer the start and end addresses of rowkeys based on the predicate conditions of HBase columns and configure the start and end addresses of HBase scan during tableScan for higher access performance.

    For example, assume that the rowkey of the HBase data table consists of four columns: building_code:house_code:floor:uuid. For the search criteria where building_code = '123' and house_code = '456', the HetuEngine single-table query optimization scans only columns whose rowkey range prefixes are 123 to 456, improving performance.

    To enable the single HBase table query range scanning optimization function, add the custom parameter hbase.rowkey.adaptive.optimization.enabled to 5.c and set it to true.

    In addition, you need to specify the columns and separators of rowkeys in the table creation property of table creation statements.

    Table 3 Columns and separators of HBase rowkeys

    Table Property

    Description

    Example Value

    row_id_construct_columns

    Columns of rowkeys in an HBase data table

    building_code:house_code:floor:uuid

    row_id_construct_columns_terminal

    Separator of columns of rowkeys in an HBase data table

    :

    For example, a table creation statement containing a rowkey consisting of four columns building_code:house_code:floor:uuid is as follows:

    CREATE TABLE test.table_hbase_test (
    row_id string,
    col1 string,
    col2 string,
    col3 string,
    building_code string,
    house_code string,
    floor string,
    uuid string)
    WITH (column_mapping = '
    col1:attr:col1,
    col2:attr:col2,
    col3:attr:col3,
    building_code:attr:building_code,
    house_code:attr:house_code,
    floor:attr:floor,
    uuid:attr:uuid',
    row_id = 'row_id',
    row_id_construct_columns = 'building_code:house_code:floor:uuid',
    row_id_construct_columns_terminal = ':',
    hbase_table_name='test:table_hbase_test',
    external = true)
  • Dynamic filtering optimization for HBase multi-table join query

    HBase supports optimization of six operators: like, >, >=, <, <=, and =.

    To enable the dynamic filtering function, enable the HBase single table query range scanning optimization function, add the custom parameter dynamic_filtering_pushdown_callexpression in the coordinator.config.properties parameter file of compute instances, and set the parameter to true. For details, see 3.e.

Constraints

The ALTER and VIEW syntaxes are not supported.