Configuring an HBase Data Source

Scenario

This section describes how to add an HBase data source on HSConsole.

Prerequisites

The domain name of the cluster where the data source is located must be different from the HetuEngine cluster domain name.
The cluster where the data source is located and the HetuEngine cluster nodes can communicate with each other.
In the /etc/hosts file of all nodes in the cluster where HetuEngine is located, add the mapping between the host names and IP addresses of the cluster where the data source to be connected is located, and add 10.10.10.10 hadoop.System domain name in the /etc/hosts file (for example, 10.10.10.10 hadoop.hadoop.com). Otherwise, HetuEngine cannot connect to the nodes that are not in the cluster based on the host name.
A HetuEngine compute instance has been created.

The SSL communication encryption configuration of ZooKeeper in the cluster where the data source is located must be the same as that of ZooKeeper in the cluster where HetuEngine is located.

To check whether SSL communication encryption is enabled, log in to FusionInsight Manager, choose Cluster > Services > ZooKeeper > Configurations > All Configurations, and enter ssl.enabled in the search box. If the value of ssl.enabled is true, SSL communication encryption is enabled. If the value is false, SSL communication encryption is disabled.

Procedure

Obtain the hbase-site.xml, hdfs-site.xml, and core-site.xml configuration files of the HBase data source.
1. Log in to FusionInsight Manager of the cluster where the HBase data source is located.
2. In the upper right corner of the homepage, click Download Client to download the complete client as prompted.
3. Decompress the downloaded client file package and obtain the hbase-site.xml, core-site.xml, and hdfs-site.xml files in the FusionInsight_Cluster_1_Services_ClientConfig/HBase/config directory.
Obtain the user.keytab and krb5.conf files of the proxy user of the HBase data source.
1. Log in to FusionInsight Manager of the cluster where the HBase data source is located.
2. Choose System > Permission > User.
3. Locate the row that contains the target data source user, click More in the Operation column, and select Download Authentication Credential.
4. Decompress the downloaded package to obtain the user.keytab and krb5.conf files.
The proxy user of the data source must have the permission to perform HBase operations.
Log in to FusionInsight Manager as a HetuEngine administrator and choose Cluster > Services > HetuEngine. The HetuEngine service page is displayed.
In the Basic Information area on the Dashboard page, click the link next to HSConsole WebUI. The HSConsole page is displayed.

Choose Data Source and click Add Data Source. Configure parameters on the Add Data Source page.

In the Basic Configuration area, configure Name and choose HBase for Data Source Type.

Configure parameters in the HBase Configuration area. For details, see Table 1.

**Table 1** HBase Configuration
Parameter	Description	Example Value
Driver	The default value is hbase-connector.	hbase-connector
ZooKeeper Quorum Address	Service IP addresses of all quorumpeer instances of the ZooKeeper service for the data source. If the ZooKeeper service of the data source uses IPv6, you need to specify the client port number in the ZooKeeper Quorum address. Log in to FusionInsight Manager, choose Cluster > Services > ZooKeeper > Instance, and view the IP addresses of all the hosts housing the quorumpeer instances.	IPv4: 10.10.10.10,10.10.10.11,10.10.10.12 IPv6: [10:10::10:11]:24002
ZooKeeper Client Port Number	Port number of the ZooKeeper client. Log in to FusionInsight Manager and choose Cluster > Service > ZooKeeper. On the Configurations tab page, check the value of clientPort.	2181
HBase RPC Communication Protection	Set this parameter based on the value of hbase.rpc.protection in the hbase-site.xml file obtained in 1. If the value is authentication, set this parameter to No. If the value is privacy, set this parameter to Yes.	No
Security Authentication Mechanism	After the security mode is enabled, the default value is KERBEROS.	KERBEROS
Principal	Configure this parameter when the security authentication mechanism is enabled. Set the parameter to the user to whom the user.keytab file obtained in 2 belongs.	user_hbase@HADOOP2.COM
Keytab File	Configure this parameter when the security mode is enabled. It specifies the security authentication key. Select the user.keytab file obtained in 2.	user.keytab
krb5 File	Configure this parameter when the security mode is enabled. It is the configuration file used for Kerberos authentication. Select the krb5.conf file obtained in 2.	krb5.conf
hbase-site File	Configure this parameter when the security mode is enabled. It is the configuration file required for connecting to HDFS. Select the hbase-site.xml file obtained in 1.	hbase-site.xml
core-site File	Configure this parameter when the security mode is enabled. This file is required for connecting to HDFS. Select the core-site.xml file obtained in 1.	core-site.xml
hdfs-site File	Configure this parameter when the security mode is enabled. This file is required for connecting to HDFS. Select the hdfs-site.xml file obtained in 1.	hdfs-site.xml

(Optional) Customize the configuration.
Click OK.

Log in to the node where the cluster client is located and run the following commands to switch to the client installation directory and authenticate the user:

cd /opt/client

source bigdata_env

kinit User performing HetuEngine operations (If the cluster is in normal mode, skip this step.)
Run the following command to log in to the catalog of the data source:

hetu-cli --catalog Data source name --schema Database name

For example, run the following command:

hetu-cli --catalog hbase_1 --schema default
Run the following command. If the database table information can be viewed or no error is reported, the connection is successful.

show tables;

Create a structured mapping table.

The format of the statement for creating a mapping table is as follows:

CREATE TABLE schemaName.tableName (
  rowId VARCHAR,
  qualifier1 TINYINT,
  qualifier2 SMALLINT,
  qualifier3 INTEGER,
  qualifier4 BIGINT,
  qualifier5 DOUBLE,
  qualifier6 BOOLEAN,
  qualifier7 TIME,
  qualifier8 DATE,
  qualifier9 TIMESTAMP
)
WITH (
column_mapping = 'qualifier1:f1:q1,qualifier2:f1:q2,qualifier3:f2:q3,qualifier4:f2:q4,qualifier5:f2:q5,qualifier6:f3:q1,qualifier7:f3:q2,qualifier8:f3:q3,qualifier9:f3:q4',
row_id = 'rowId',
hbase_table_name = 'hbaseNamespace:hbaseTable',
external = true
);

The value of schemaName must be the same as that of hbaseNamespace in hbase_table_name.

Supported mapping tables: Mapping tables can be directly associated with tables in the HBase data source or created and associated with new tables that do not exist in the HBase data source.
Supported data types in a mapping table: VARCHAR, TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, BOOLEAN, TIME, DATE, and TIMESTAMP

The following table describes the keywords in the statements for creating mapping tables.

**Table 2** Keywords in the statements for creating mapping tables
Keyword	Type	Mandatory	Default Value	Remarks
column_mapping	String	No	All columns belong to the same Family column family.	Specify the mapping between columns in the mapping table and column families in the HBase data source table. To associate a table in the HBase data source, set this parameter to the same value as that configured in the HBase data source. To create a table that does not exist in the HBase data source, configure this parameter. Value format: Mapping table column name:HBase column family:HBase column name. Mapping table column names must be in lowercase. HBase column names must be the same as that in HBase.
row_id	String	No	First column in the mapping table	Column name corresponding to the rowkey table in the HBase data source
hbase_table_name	String	No	N/A	Tablespace and table name of the HBase data source to be associated. Use a colon (:) to separate them. The default tablespace is default. If a new table that does not exist in the HBase data source is created, hbase_table_name does not need to be specified.
external	Boolean	No	true	If external is set to true, the table is a mapping table in the HBase data source and the original table in the HBase data source cannot be deleted. If external is set to false, the table in the HBase data source is deleted when the Hetu-HBase table is deleted.

Data Type Mapping

HBase is a byte-based distributed storage system that stores all data types as byte arrays. To represent HBase data in HetuEngine, select a data type that matches the value of the HBase column qualifier for the HetuEngine column qualifier by creating a mapping table in HetuEngine.

Currently, HetuEngine column qualifiers support the following data types: VARCHAR, TINYINT, SMALLINT, INTEGER, BIGINT, DOUBLE, BOOLEAN, TIME, DATE, and TIMESTAMP.

Performance Optimization

Predicate pushdown
Queries support pushdown of most operators. The following predicate conditions are supported: =, >=, >, <, <=, !=, IN, NOT IN, IS NULL, IS NOT NULL, and BETWEEN AND.
Batch GET query
Multiple row keys to be queried are encapsulated into one List<Get> in the HBase API, and then the list is requested to query data. In this way, each row key does not need to initiate a request separately.

HBase single-table query range scanning optimization

The HBase single-table query range scanning optimization is to automatically infer the start and end addresses of rowkeys based on the predicate conditions of HBase columns and configure the start and end addresses of HBase scan during tableScan for higher access performance.

For example, assume that the rowkey of the HBase data table consists of four columns: building_code:house_code:floor:uuid. For the search criteria where building_code = '123' and house_code = '456', the HetuEngine single-table query optimization scans only columns whose rowkey range prefixes are 123 to 456, improving performance.

To enable the single HBase table query range scanning optimization function, add the custom parameter hbase.rowkey.adaptive.optimization.enabled to 5.c and set it to true.

In addition, you need to specify the columns and separators of rowkeys in the table creation property of table creation statements.

**Table 3** Columns and separators of HBase rowkeys
Table Property	Description	Example Value
row_id_construct_columns	Columns of rowkeys in an HBase data table	building_code:house_code:floor:uuid
row_id_construct_columns_terminal	Separator of columns of rowkeys in an HBase data table	:

For example, a table creation statement containing a rowkey consisting of four columns building_code:house_code:floor:uuid is as follows:

CREATE TABLE test.table_hbase_test (
row_id string,
col1 string,
col2 string,
col3 string,
building_code string,
house_code string,
floor string,
uuid string)
WITH (column_mapping = '
col1:attr:col1,
col2:attr:col2,
col3:attr:col3,
building_code:attr:building_code,
house_code:attr:house_code,
floor:attr:floor,
uuid:attr:uuid',
row_id = 'row_id',
row_id_construct_columns = 'building_code:house_code:floor:uuid',
row_id_construct_columns_terminal = ':',
hbase_table_name='test:table_hbase_test',
external = true)

Dynamic filtering optimization for HBase multi-table join query
HBase supports dynamic filtering optimization.

To enable the dynamic filtering function, enable the HBase single table query range scanning optimization function, add the custom parameter enable-dynamic-filtering in the coordinator.config.properties and worker.config.properties parameter files of compute instances, and set the parameter to true. For details, see 3.e.