Updated on 2022-09-14 GMT+08:00

Scenario Description

Scenario Description

Assume that person table of Hive stores a user's consumption amount on the current day and HBase table2 stores the user's history consumption amount data.

In the person table, the name=1,account=100 record indicates that user1's consumption amount on the current day is 100 CNY.

In table2, the key=1,cf:cid=1000 record indicates that user1's history consumption amount is 1000 CNY.

Based on some service requirements, a Spark application must be developed to implement the following functions:

Calculate a user's history consumption amount based on the user name, that is, the user's total consumption amount =100 (consumption amount of the current day) + 1000 (history consumption amount).

In the preceding example, the application run result is that in table2, the total consumption amount of user1 (key=1) is cf:cid=1100 CNY.

Data Planning

Before developing the application, create a Hive table named person and insert data to the table. At the same time, create HBase table2 so that you can write the data analysis result to it.

  1. Save original log files to HDFS.

    1. Create a blank log1.txt file on the local PC and write the following content to the file.
      1,100
    2. Create the /tmp/input directory in HDFS and upload the log1.txt file to the directory.
      1. On the HDFS client, run the following commands for authentication:

        cd /opt/client

        kinit -kt '/opt/client/Spark/spark/conf/user.keytab' <Service user for authentication>

        Specify the path of the user.keytab file based on the site requirements.

      2. On the HDFS client running the Linux OS, run the hadoop fs -mkdir /tmp/input command (or the hdfs dfs command) to create a directory.
      3. On the HDFS client running the Linux OS, run the hadoop fs -put log1.txt /tmp/input command to upload the data file.

  2. Store the imported data to the Hive table.

    Ensure that the ThriftServer is started. Use the Beeline tool to create a Hive table and insert data to the table.

    1. Run the following command to create a Hive table named person:

      create table person

      (

      name STRING,

      account INT

      )ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;

    2. Run the following command to insert data to the person table:

      load data inpath '/tmp/input/log1.txt' into table person;

  3. Create an HBase table.

    1. Run the following command to create a table named table2 through HBase:

      create 'table2', 'cf'

    2. Run the following command on HBase to insert data to HBase table2:

      put 'table2', '1', 'cf:cid', '1000'

    If Kerberos authentication is enabled, set spark.yarn.security.credentials.hbase.enabled in the client configuration file spark-default.conf and on the sparkJDBC server to true.

Development Guidelines

  1. Query data in the person Hive table.
  2. Query data in table2 based on the key value in the person table.
  3. Sum the data records obtained in the previous two steps.
  4. Write the result of the previous step to table2.