Instance

Scenario

Assume that table person of Hive stores the user consuption amount of the current day and table2 of HBase stores the history consumption data.

In table person, name=1 and account=100 indicates that the comsuption amount of user1 in the current day is 100 CNY.

In table2, key=1 and cf:cid=1000 indicate that the history comsuption amount of user1 is 1000 CNY.

The Spark application shall achieve the following function:

Add the current consumption amount (100) to the history consumption amount (1000).

The runing result is that the total consumption amount of user 1 (key=1) in table2 is 1100 CNY (cf:cid=1100).

Data Preparation

Before develop the application, create the Hive table person and insert data to it. Create HBase table2.

Place the source log file to HDFS.
1. Create a blank file log1.txt in the local and write the following content to the file:
```
1,100
```
2. Create a directory /tmp/input in HDFS and copy the log1.txt file to the directory.
  1. On a Linux-based HDFS client, run the hadoop fs -mkdir /tmp/input command (or hdfs dfs -mkdir /tmp/input) to create the /tmp/input directory.
  2. On a Linux-base HDFS client, run hadoop fs -put log1.txt /tmp/input command to import data files.
Ensure that JDBCServer is started. Use the Beeline command tool to create a Hive table and insert data to it.
1. Run the following commands to create the Hive table person:
  create table person
  
  (
  
  name STRING,
  
  account INT
  
  )
  
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;
2. Run the following command to insert data to the table:
  load data inpath '/tmp/input/log1.txt' into table person;
Create a HBase table:

Ensure that JDBCServer is started. Use the Spark-beeline command tool to create a HBase table and insert data to it.
1. Run the following commands to create the HBase table table2:
  create table table2
  
  (
  
  key string,
  
  cid string
  
  )
  
  using org.apache.spark.sql.hbase.HBaseSource
  
  options(
  
  hbaseTableName "table2",
  
  keyCols "key",
  
  colsMapping "cid=cf.cid");
2. Run the following command to insert data to the table:
  put 'table2', '1', 'cf:cid', '1000'

Development Idea

Query the data in Hive table person.
Query the data in table2 using the key value of table person.
Add the queried data.
Write the results of the preceding step to table2.

Packaging the Project

Use the Maven tool provided by IDEA to pack the project and generate a JAR file. For details, see Compiling and Running the Application.
Upload the JAR file to any directory (for example, /opt/female/) on the server where the Spark client is located.

Running Tasks

Go to the Spark client directory and run the following commands to invoke the bin/spark-submit script to run the code (The class name and file name must be the same as those in the actual code. The following is only an example):

Run Java or Scala example code.
bin/spark-submit --class com.huawei.bigdata.spark.examples.SparkHivetoHbase --master yarn --deploy-mode client/opt/female/SparkHivetoHbase-1.0.jar
Run the Python sample program

PySpark does not provide HBase-related APIs. Therefore, Python is used to invoke Java code in this sample. Use Maven to pack the provided Java code into a JAR file and place it in the same directory. When running the Python program, use --jars to load the JAR file to classpath

bin/spark-submit --master yarn --deploy-mode client --jars /opt/female/SparkHivetoHbasePythonExample/SparkHivetoHbase-1.0.jar /opt/female/SparkHivetoHbasePythonExample/SparkHivetoHbasePythonExample.py