Configuring Spark to Read HBase Data

Scenario

Spark on HBase allows users to query HBase tables in Spark SQL and to store data for HBase tables by using the Beeline tool. You can use HBase APIs to create, read data from, and insert data into tables.

Spark On HBase

Log in to Manager and choose Cluster > Cluster Properties to check whether the cluster is in security mode.
- If yes, go to 2.
- If no, go to 5.

Choose Cluster > Services > Spark2x. Click Configurations, click All Configurations, click JDBCServer2x, select Default, and modify the following parameter:

**Table 1** Parameter list 1
Parameter	Default Value	Changed To
spark.yarn.security.credentials.hbase.enabled	false	true

To ensure that Spark2x can access HBase for a long time, do not modify the following parameters of the HBase and HDFS services:

dfs.namenode.delegation.token.renew-interval
dfs.namenode.delegation.token.max-lifetime
hbase.auth.key.update.interval
hbase.auth.token.max.lifetime (The value is fixed to 604800000 ms, that is, 7 days.)

If the preceding parameter configuration must be modified based on service requirements, ensure that the value of the HDFS parameter dfs.namenode.delegation.token.renew-interval is not greater than the values of the HBase parameters hbase.auth.key.update.interval, hbase.auth.token.max.lifetime, and dfs.namenode.delegation.token.max-lifetime.

Choose SparkResource2x > Default and modify the following parameters.

**Table 2** Parameter list 2
Parameter	Default Value	Changed To
spark.yarn.security.credentials.hbase.enabled	false	true

Restart the Spark2x service for the configuration to take effect.

To use the Spark on HBase function on the Spark2x client, you need to download and install the Spark2x client again.
On the Spark2x client, use the spark-sql or spark-beeline connection to query tables created by Hive on HBase. You can create an HBase table by running SQL commands or create an external table to associate the HBase table. Before creating tables, ensure that HBase tables exist in HBase. The HBase table table1 is used as an example.
1. Run the following commands to create the HBase table using the Beeline tool:
  create table hbaseTable
  
  (
  
  id string,
  
  name string,
  
  age int
  
  )
  
  using org.apache.spark.sql.hbase.HBaseSource
  
  options(
  
  hbaseTableName "table1",
  
  keyCols "id",
  
  colsMapping "
  
  name=cf1.cq1,
  
  age=cf1.cq2
  
  ");
  - hbaseTable: name of the created Spark table
  - id string,name string, age int: field name and field type of the Spark table
  - table1: name of the HBase table
  - id: row key column name of the HBase table
  - name=cf1.cq1, age=cf1.cq2: mapping between columns in the Spark table and columns in the HBase table. The name column of the Spark table maps the cq1 column in the cf1 column family of the HBase table, and the age column of the Spark table maps the cq2 column in the cf1 column family of the HBase table.
2. Import data to the HBase table using a CSV file.
  hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,cf1:cq1,cf1:cq2,cf1:cq3,cf1:cq4,cf1:cq5 table1 /hperson
  
  table1 indicates the name of the HBase table and /hperson indicates the path where the CSV file is stored.
3. Query data in spark-sql or spark-beeline. hbaseTable is the corresponding Spark table name. The command is as follows:
  select * from hbaseTable;

Spark on HBaseV2

Log in to Manager and choose Cluster > Cluster Properties to check whether the cluster is in security mode.
- If yes, go to 2.
- If no, go to 5.

Click Cluster and click the name of the desired cluster. Choose Service > Spark2x, click Configurations, click All Configurations, and choose JDBCServer2x > Default. Modify the following parameter.

**Table 3** Parameter list 1
Parameter	Default Value	Changed To
spark.yarn.security.credentials.hbase.enabled	false	true

To ensure that Spark2x can access HBase for a long time, do not modify the following parameters of the HBase and HDFS services:

dfs.namenode.delegation.token.renew-interval
dfs.namenode.delegation.token.max-lifetime
hbase.auth.key.update.interval
hbase.auth.token.max.lifetime (The value is fixed to 604800000 ms, that is, 7 days.)

Choose SparkResource2x > Default and modify the following parameters.

**Table 4** Parameter list 2
Parameter	Default Value	Changed To
spark.yarn.security.credentials.hbase.enabled	false	true

Restart the Spark2x service for the configuration to take effect.

If you need to use the Spark on HBase function on the Spark2x client, download and install the Spark2x client again.
On the Spark2x client, use the spark-sql or spark-beeline connection to query tables created by Hive on HBase. You can create an HBase table by running SQL commands or create an external table to associate the HBase table. For details, see the following description. The following uses the HBase table table1 as an example.
1. Create a table using the spark-beeline tool.
  create table hbaseTable1
  
  (id string, name string, age int)
  
  using org.apache.spark.sql.hbase.HBaseSourceV2
  
  options(
  
  hbaseTableName "table2",
  
  keyCols "id",
  
  colsMapping "name=cf1.cq1,age=cf1.cq2");
  - hbaseTable1: name of the created Spark table
  - id string,name string, age int: field name and field type of the Spark table
  - table2: name of the HBase table
  - id: row key column name of the HBase table
  - name=cf1.cq1, age=cf1.cq2: mapping between columns in the Spark table and columns in the HBase table. The name column of the Spark table maps the cq1 column in the cf1 column family of the HBase table, and the age column of the Spark table maps the cq2 column in the cf1 column family of the HBase table.
2. Import data to the HBase table using a CSV file.
  hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator="," -Dimporttsv.columns=HBASE_ROW_KEY,cf1:cq1,cf1:cq2,cf1:cq3,cf1:cq4,cf1:cq5 table2 /hperson
  
  table2 indicates the name of the HBase table and /hperson indicates the path where the CSV file is stored.
3. Query data in spark-sql or spark-beeline. hbaseTable1 indicates the corresponding Spark table name.
  select * from hbaseTable1;