Development Plan

Scenarios

Assume that table person of Hive stores the user consumption amount of the current day and table2 of HBase stores the history consumption data.

In the person table, "name=1" and "account=100" indicate that user1's consumption amount on the current day is CNY 100.

In table2, "key=1" and "cf:cid=1000" indicate that the history comsuption amount of user 1 is CNY 1,000.

The Spark application shall achieve the following function:

To determine a user's total consumption amount, add their current day's consumption of CNY 100 to their history consumption amount of CNY 1,000, based on their username.

In table2, user 1 (key=1) has a total consumption amount of CNY 1,100 (cf:cid=1100).

Preparing Data

Before developing the application, create the Hive table person and insert data to it. Create HBase table2 and write analyzed data to this table.

Save original log files to HDFS.
1. Create a blank log1.txt file on the local PC and write the following content to the file.
```
1,100
```
2. Create a directory /tmp/input in HDFS and copy the log1.txt file to the directory.
  1. On the HDFS client, run the following commands to obtain the security authentication:
    cd /opt/hadoopclient
    
    kinit <Service user for authentication>
  2. On the Linux HDFS client, run the hadoop fs -mkdir /tmp/input command (or the hdfs dfs command) to create a directory.
  3. On the Linux HDFS client, run the hadoop fs -put log1.txt /tmp/input command to upload the data file.
Store the imported data to the Hive table.

Ensure that JDBCServer is started. Use the Beeline tool to create a Hive table and insert data to it.
1. Run the following commands to create the Hive table person:
  create table person
  
  (
  
  name STRING,
  
  account INT
  
  )
  
  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;
2. Run the following command to insert data to the table:
  load data inpath '/tmp/input/log1.txt' into table person;
Create a HBase table:

Ensure that JDBCServer is started. Use the Spark-Beeline tool to create a HBase table and insert data to it.
1. Run the following commands to create the HBase table table2:
  create table table2
  
  (
  
  key string,
  
  cid string
  
  )
  
  using org.apache.spark.sql.hbase.HBaseSource
  
  options(
  
  hbaseTableName "table2",
  
  keyCols "key",
  
  colsMapping "cid=cf.cid"
  
  );
2. Run the following command on HBase to insert data to the table:
  put 'table2', '1', 'cf:cid', '1000'

Development Idea

Query the data in Hive table person.
Query the data in table2 using the key value of table person.
Add the queried data.
Write the results of the preceding step to table2.

Configuration Operations Before Running

In security mode, the Spark Core sample code needs to read two files (user.keytab and krb5.conf). The user.keytab and krb5.conf files are authentication files in the security mode. Download the authentication credentials of the user principal on the FusionInsight Manager page. The user in the sample code is sparkuser, change the value to the prepared development user name.

Packaging the Project

Upload the user.keytab and krb5.conf files to the server where the client is installed.
Use the Maven tool provided by IDEA to pack the project and generate a JAR file. For details, see Writing and Running the Spark Program in the Linux Environment.
- Before compilation and packaging, change the paths of the user.keytab and krb5.conf files in the sample code to the actual paths on the client server where the files are located. Example: /opt/female/user.keytab and /opt/female/krb5.conf.
- Before running the sample project, set the spark.yarn.security.credentials.hbase.enabled configuration item to true in the spark-defaults.conf configuration file of Spark client. (The default value is false. Changing the value to true does not affect existing services.) If you want to uninstall the HBase service, change the value back to false first.
Upload the JAR file to any directory (for example, /opt/female/) on the server where the Spark client is located.

Running Tasks

Go to the Spark client directory and run the following commands to invoke the bin/spark-submit script to run the code (the class name and file name must be the same as those in the actual code. The following is only an example):

Run Java or Scala sample code.
bin/spark-submit --class com.huawei.bigdata.spark.examples.SparkHivetoHbase --master yarn --deploy-mode client /opt/female/SparkHivetoHbase-1.0.jar
Run the Python sample project.
- PySpark does not provide HBase-related APIs. Therefore, Python is used to invoke Java code in this sample. Use Maven to pack the provided Java code into a JAR file and place it in the same directory. When running the Python program, use --jars to load the JAR file to classpath.
- The Python sample code does not provide authentication information. Configure --keytab and --principal to specify authentication information.
bin/spark-submit --master yarn --deploy-mode client --keytab /opt/FIclient/user.keytab --principal sparkuser --jars /opt/female/SparkHivetoHbasePythonExample/SparkHivetoHbase-1.0.jar /opt/female/SparkHivetoHbasePythonExample/SparkHivetoHbasePythonExample.py