Development Plan
Overview
Assume that table person of Hive stores the user consumption amount of the current day and table2 of HBase stores the history consumption data.
In the person table, "name=1" and "account=100" indicate that user1's consumption amount on the current day is CNY 100.
In table2, "key=1" and "cf:cid=1000" indicate that the history consumption amount of user 1 is CNY 1,000.
The Spark application shall achieve the following function:
To determine a user's total consumption amount, add their current day's consumption of CNY 100 to their history consumption amount of CNY 1,000, based on their username.
In table2, user 1 (key=1) has a total consumption amount of CNY 1,100 (cf:cid=1100).
Preparing Data
Before developing the application, create the Hive table person and insert data to it. Create HBase table2 and write analyzed data to this table.
- Place the source log file to HDFS.
- Create a blank file log1.txt in the local and write the following content to the file:
1,100
- Create a directory /tmp/input in HDFS and copy the log1.txt file to the directory.
- On the Linux HDFS client, run the hadoop fs -mkdir /tmp/input command (or the hdfs dfs command) to create a directory.
- On the Linux HDFS client, run the hadoop fs -put log1.txt /tmp/input command to upload the data file.
- Create a blank file log1.txt in the local and write the following content to the file:
- Store the imported data to the Hive table.
Ensure that JDBCServer is started. Use the Beeline tool to create a Hive table and insert data to it.
- Create the Hive table person.
create table person
(
name STRING,
account INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;
- Insert data to the person table.
load data inpath '/tmp/input/log1.txt' into table person;
- Create the Hive table person.
- Create an HBase table.
Ensure that JDBCServer is started. Use the Spark-beeline command tool to create an HBase table and insert data to it.
- Create the HBase table table2.
create table table2
(
key string,
cid string
)
using org.apache.spark.sql.hbase.HBaseSource
options(
hbaseTableName "table2",
keyCols "key",
colsMapping "cid=cf.cid");
- Insert data to the table2 table.
put 'table2', '1', 'cf:cid', '1000'
- Create the HBase table table2.
Development Guidelines
- Query data in the person Hive table.
- Query data in table2 based on the key value in the person table.
- Add the queried data.
- Write the result of the previous step to table2.
Packaging the Project
- Use the Maven tool provided by IDEA to pack the project and generate a JAR file. For details, see Commissioning a Spark Application in a Linux Environment.
- Upload the JAR file to any directory (for example, /opt/female/) on the server where the Spark client is located.
Running the Task
Go to the Spark client directory and run the following commands to invoke the bin/spark-submit script to run the code (The class name and file name must be the same as those in the actual code. The following is only an example.):
- Run Java or Scala sample code.
- bin/spark-submit --class com.huawei.bigdata.spark.examples.SparkHivetoHbase --master yarn --deploy-mode client /opt/female/SparkHivetoHbase-1.0.jar
- Run the Python sample project.
- PySpark does not provide HBase APIs. Therefore, Python is used to invoke Java code in this sample. Use Maven to pack the provided Java code into a JAR file and place it in the same directory. When running the Python program, use --jars to load the JAR file to classpath.
- bin/spark-submit --master yarn --deploy-mode client --jars /opt/female/SparkHivetoHbasePythonExample/SparkHivetoHbase-1.0.jar /opt/female/SparkHivetoHbasePythonExample/SparkHivetoHbasePythonExample.py
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot