Configuring Spark to Read HBase Data

Scenarios

Spark on HBase enables high-performance data processing and seamless interaction with HBase's distributed storage. With Spark on HBase, you can query HBase tables directly within the Spark SQL environment. With Spark's powerful distributed computing capabilities and flexible SQL query syntax, you can efficiently obtain required data from the HBase distributed column-oriented storage system.

You can use the Beeline tool to store data in HBase tables. Beeline is a widely used command-line tool that provides an intuitive and easy-to-use interface for related operations, facilitating data storage to HBase tables.

In addition, you can manage HBase tables using HBase APIs. With these user-friendly APIs, you can create new tables to store data for specific services, read data from existing tables for analysis, and insert new data records into tables. This seamless integration fully harnesses the strengths of both Spark and HBase, delivering a powerful and flexible solution for efficient data processing and storage.

Spark on HBase

Log in to FusionInsight Manager and choose Cluster > Cluster Properties to check whether the cluster is in security mode.
- If yes, go to 2.
- If no, go to 5.

Choose Cluster > Services > Spark2x, click Configurations and then All Configurations, choose JDBCServer2x > Default, and modify the following parameters.

**Table 1** Parameters
Parameter	Description	Example Value
spark.yarn.security.credentials.hbase.enabled	Whether to enable the HBase authentication function. If the Spark-on-HBase function is required and the cluster is in security mode, set this parameter to true. true: The HBase authentication function is enabled. false: The HBase authentication function is disabled.	true

To ensure that Spark can access HBase for a long time, do not modify the following parameters of the HBase and HDFS services:

dfs.namenode.delegation.token.renew-interval: specifies the interval at which a delegation token can be renewed.
dfs.namenode.delegation.token.max-lifetime: specifies the maximum duration a delegation token is valid for.
hbase.auth.key.update.interval: specifies the interval at which the HBase authentication key is updated.
hbase.auth.token.max.lifetime: specifies the maximum duration for which an authentication token issued by HBase remains valid. The value is fixed to 604800000 ms (7 days) and cannot be changed.

If the preceding parameter settings must be modified based on service requirements, ensure that the value of the HDFS parameter dfs.namenode.delegation.token.renew-interval is not greater than the values of the parameters hbase.auth.key.update.interval, hbase.auth.token.max.lifetime, and dfs.namenode.delegation.token.max-lifetime.

Choose SparkResource2x > Default and modify the following parameters.

**Table 2** Parameters
Parameter	Description	Example Value
spark.yarn.security.credentials.hbase.enabled	Whether to enable the HBase authentication function. If the Spark-on-HBase function is required and the cluster is in security mode, set this parameter to true. true: The HBase authentication function is enabled. false: The HBase authentication function is disabled.	true

Restart the Spark service for the configuration to take effect.

To use the Spark on HBase function on the Spark client, you need to download and install the Spark client again.

On the Spark client, use the spark-sql or spark-beeline connection to query tables created by Hive on HBase. You can create an HBase table by running SQL commands or create an external table to associate the HBase table. Before creating tables, ensure that HBase tables exist in HBase. The HBase table table1 is used as an example.

Run the following commands to create the HBase table using the Beeline tool:

create table hbaseTable
(
id string,
name string,
age int
)
using org.apache.spark.sql.hbase.HBaseSource
options(
hbaseTableName "table1",
keyCols "id",
colsMapping  "
name=cf1.cq1,
age=cf1.cq2
");

**Table 3** Parameter description
Parameter	Description
hbaseTable	Name of the created Spark table.
id string,name string, age int	Field name and type of the Spark table.
table1	HBase table name.
id	Row key column name of the HBase table.
name=cf1.cq1, age=cf1.cq2	Mapping between columns in the Spark table and those in the HBase table. The name column of the Spark table maps the cq1 column in the cf1 column family of the HBase table, and the age column of the Spark table maps the cq2 column in the cf1 column family of the HBase table.

Import data to the HBase table using a CSV file.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,cf1:cq1,cf1:cq2,cf1:cq3,cf1:cq4,cf1:cq5 table1 /hperson

table1 indicates the name of the HBase table and /hperson indicates the path where the CSV file is stored.

Run the following command in spark-sql or spark-beeline to query data. In the command, hbaseTable indicates the Spark table name.
```
select * from hbaseTable;
```

Spark on HBaseV2

Log in to FusionInsight Manager and choose Cluster > Cluster Properties to check whether the cluster is in security mode.
- If yes, go to 2.
- If no, go to 5.

Choose Cluster > Services > Spark2x, click Configurations and then All Configurations, choose JDBCServer2x > Default, and modify the following parameters.

**Table 4** Parameters
Parameter	Description	Example Value
spark.yarn.security.credentials.hbase.enabled	Whether to enable the HBase authentication function. If the Spark-on-HBase function is required and the cluster is in security mode, set this parameter to true. true: The HBase authentication function is enabled. false: The HBase authentication function is disabled.	true

To ensure that Spark can access HBase for a long time, do not modify the following parameters of the HBase and HDFS services:

dfs.namenode.delegation.token.renew-interval: specifies the interval at which a delegation token can be renewed.
dfs.namenode.delegation.token.max-lifetime: specifies the maximum duration a delegation token is valid for.
hbase.auth.key.update.interval: specifies the interval at which the HBase authentication key is updated.
hbase.auth.token.max.lifetime: specifies the maximum duration for which an authentication token issued by HBase remains valid. The value is fixed to 604800000 ms (7 days) and cannot be changed.

Choose SparkResource2x > Default and modify the following parameters.

**Table 5** Parameters
Parameter	Description	Example Value
spark.yarn.security.credentials.hbase.enabled	Whether to enable the HBase authentication function. If the Spark-on-HBase function is required and the cluster is in security mode, set this parameter to true. true: The HBase authentication function is enabled. false: The HBase authentication function is disabled.	true

Restart the Spark service for the configuration to take effect.

To use the Spark on HBase function on the Spark client, you need to download and install the Spark client again.

On the Spark client, use the spark-sql or spark-beeline connection to query tables created by Hive on HBase. You can create an HBase table by running SQL commands or create an external table to associate the HBase table. For details, see the following description. The following uses the HBase table table1 as an example.

Create a table using the spark-beeline tool.

create table hbaseTable1
(id string, name string, age int)
using org.apache.spark.sql.hbase.HBaseSourceV2
options(
hbaseTableName "table2",
keyCols "id",
colsMapping  "name=cf1.cq1,age=cf1.cq2");

**Table 6** Parameter description
Parameter	Description
hbaseTable1	Name of the created Spark table.
id string,name string, age int	Field name and type of the Spark table.
table2	HBase table name.
id	Row key column name of the HBase table.
name=cf1.cq1, age=cf1.cq2	Mapping between columns in the Spark table and those in the HBase table. The name column of the Spark table maps the cq1 column in the cf1 column family of the HBase table, and the age column of the Spark table maps the cq2 column in the cf1 column family of the HBase table.

Import data to the HBase table using a CSV file.

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,cf1:cq1,cf1:cq2,cf1:cq3,cf1:cq4,cf1:cq5 table2 /hperson

table2 indicates the name of the HBase table and /hperson indicates the path where the CSV file is stored.

Run the following command in spark-sql or spark-beeline to query data. In the command, hbaseTable1 indicates the Spark table name.
```
select * from hbaseTable1;
```