Instance
Scenario
Assume that table person of Hive stores the user consuption amount of the current day and table2 of HBase stores the history consumption data.
In table person, name=1 and account=100 indicates that the comsuption amount of user1 in the current day is 100 CNY.
In table2, key=1 and cf:cid=1000 indicate that the history comsuption amount of user1 is 1000 CNY.
The Spark application shall achieve the following function:
Add the current consumption amount (100) to the history consumption amount (1000).
The runing result is that the total consumption amount of user 1 (key=1) in table2 is 1100 CNY (cf:cid=1100).
Data Preparation
Before develop the application, create the Hive table person and insert data to it. Create HBase table2.
- Place the source log file to HDFS.
- Create a blank file log1.txt in the local and write the following content to the file:
1,100
- Create a directory /tmp/input in HDFS and copy the log1.txt file to the directory.
- On a Linux-based HDFS client, run the hadoop fs -mkdir /tmp/input command (or hdfs dfs -mkdir /tmp/input) to create the /tmp/input directory.
- On a Linux-base HDFS client, run hadoop fs -put log1.txt /tmp/input command to import data files.
- Create a blank file log1.txt in the local and write the following content to the file:
- Ensure that JDBCServer is started. Use the Beeline command tool to create a Hive table and insert data to it.
- Run the following commands to create the Hive table person:
(
name STRING,
account INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;
- Run the following command to insert data to the table:
- Run the following commands to create the Hive table person:
- Create a HBase table:
Ensure that JDBCServer is started. Use the Spark-beeline command tool to create a HBase table and insert data to it.
- Run the following commands to create the HBase table table2:
(
key string,
cid string
)
using org.apache.spark.sql.hbase.HBaseSource
options(
hbaseTableName "table2",
keyCols "key",
colsMapping "cid=cf.cid");
- Run the following command to insert data to the table:
- Run the following commands to create the HBase table table2:
Development Idea
- Query the data in Hive table person.
- Query the data in table2 using the key value of table person.
- Add the queried data.
- Write the results of the preceding step to table2.
Packaging the Project
- Use the Maven tool provided by IDEA to pack the project and generate a JAR file. For details, see Compiling and Running the Application.
- Upload the JAR file to any directory (for example, /opt/female/) on the server where the Spark client is located.
Running Tasks
Go to the Spark client directory and run the following commands to invoke the bin/spark-submit script to run the code (The class name and file name must be the same as those in the actual code. The following is only an example):
- Run Java or Scala example code.
bin/spark-submit --class com.huawei.bigdata.spark.examples.SparkHivetoHbase --master yarn --deploy-mode client/opt/female/SparkHivetoHbase-1.0.jar
- Run the Python sample program
PySpark does not provide HBase-related APIs. Therefore, Python is used to invoke Java code in this sample. Use Maven to pack the provided Java code into a JAR file and place it in the same directory. When running the Python program, use --jars to load the JAR file to classpath
bin/spark-submit --master yarn --deploy-mode client --jars /opt/female/SparkHivetoHbasePythonExample/SparkHivetoHbase-1.0.jar /opt/female/SparkHivetoHbasePythonExample/SparkHivetoHbasePythonExample.py
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.