Updated on 2024-08-10 GMT+08:00

Development Plan

Overview

Assume that table1 of HBase stores a user's consumption amount on the current day and table2 stores the user's history consumption amount.

In table1, "key=1" and "cf:cid=100" indicate that the consumption amount of user 1 in the current day is CNY 100.

In table2, "key=1" and "cf:cid=1000" indicate that the history consumption amount of user 1 is CNY 1,000.

The Spark application shall achieve the following function:

To determine a user's total consumption amount, add their current day's consumption of CNY 100 to their history consumption amount of CNY 1,000, based on their username.

In table2, user 1 (key=1) has a total consumption amount of CNY 1,100 (cf:cid=1100).

Preparing Data

Use the Spark-Beeline tool to create table1 and table2 (Spark table and HBase table, respectively), and insert data by HBase.

  1. Ensure that JDBCServer is started. Log in to the Spark2x client node.
  2. Use the Spark-Beeline tool to create Spark table1.

    create table table1

    (

    key string,

    cid string

    )

    using org.apache.spark.sql.hbase.HBaseSource

    options(

    hbaseTableName "table1",

    keyCols "key",

    colsMapping "cid=cf.cid");

  3. Run the following command on HBase to insert data to table1.

    put 'table1', '1', 'cf:cid', '100'

  4. Use the Spark-Beeline tool to create Spark table2.

    create table table2

    (

    key string,

    cid string

    )

    using org.apache.spark.sql.hbase.HBaseSource

    options(

    hbaseTableName "table2",

    keyCols "key",

    colsMapping "cid=cf.cid");

  5. Run the following command on HBase to insert data to table2.

    put 'table2', '1', 'cf:cid', '1000'

Development Guidelines

  1. Query the data in table1.
  2. Query the data in table2 using the key value of table1.
  3. Add the queried data.
  4. Write the result of the previous step to table2.

Packaging the Project

  1. Use the Maven tool provided by IDEA to pack the project and generate a JAR file. For details, see Commissioning a Spark Application in a Linux Environment.
  2. Upload the JAR file to any directory (for example, /opt/female/) on the server where the Spark client is located.

Running the Task

Go to the Spark client directory and run the following commands to invoke the bin/spark-submit script to run the code (The class name and file name must be the same as those in the actual code. The following is only an example.):

  • Run Java or Scala sample code.

    bin/spark-submit --jars --conf spark.yarn.user.classpath.first=true --class com.huawei.bigdata.spark.examples.SparkHbasetoHbase --master yarn --deploy-mode client /opt/female/SparkHbasetoHbase-1.0.jar

  • Run the Python sample project.
    • PySpark does not provide HBase APIs. Therefore, Python is used to invoke Java code in this sample. Use Maven to pack the provided Java code into a JAR file and place it in the same directory. When running the Python program, use --jars to load the JAR file to classpath.

    bin/spark-submit --master yarn --deploy-mode client --conf spark.yarn.user.classpath.first=true --jars /opt/female/SparkHbasetoHbasePythonExample/SparkHbasetoHbase-1.0.jar,/opt/female/protobuf-java-2.5.0.jar /opt/female/SparkHbasetoHbasePythonExample/SparkHbasetoHbasePythonExample.py