Updated on 2022-09-14 GMT+08:00

Overview

Scenario

Develop a Spark application to perform the following operations on logs about dwell durations of netizens for shopping online:

  • Collect statistics on female netizens who dwell on online shopping for more than 2 hours at a weekend.
  • The first column in the log file records names, the second column records gender, and the third column records the dwell duration in the unit of minute. Three attributes are separated by commas (,).

    log1.txt: logs collected on Saturday

    LiuYang,female,20
    YuanJing,male,10
    GuoYijun,male,5
    CaiXuyu,female,50
    Liyuan,male,20
    FangBo,female,50
    LiuYang,female,20
    YuanJing,male,10
    GuoYijun,male,50
    CaiXuyu,female,50
    FangBo,female,60

    log2.txt: logs collected on Sunday

    LiuYang,female,20
    YuanJing,male,10
    CaiXuyu,female,50
    FangBo,female,50
    GuoYijun,male,5
    CaiXuyu,female,50
    Liyuan,male,20
    CaiXuyu,female,50
    FangBo,female,50
    LiuYang,female,20
    YuanJing,male,10
    FangBo,female,50
    GuoYijun,male,50
    CaiXuyu,female,50
    FangBo,female,60

Data Preparation

Save log files in the Hadoop distributed file system (HDFS).

  1. Create text files input_data1.txt and input_data2.txt on a local computer, and copy log1.txt to input_data1.txt and log2.txt to input_data2.txt.
  2. Create /tmp/input on the HDFS client path, and run the following commands to upload input_data1.txt and input_data2.txt to /tmp/input:
    1. On the HDFS client, run the following commands to obtain the safety authentication:

      cd /opt/hadoopclient

      source bigdata_env

      kinit <component service user>

    2. On the Linux HDFS client, run the hadoop fs -mkdir /tmp/input command (a hdfs dfs command provides the same function.) to create a directory.
    3. Go to the /tmp/input directory on the HDFS client, on the Linux HDFS client, run the hadoop fs -put input_data1.txt /tmp/input andhadoop fs -put input_data2.txt /tmp/input commands to upload data files.

Development Idea

Collects the information of female netizens who spend more than 2 hours in online shopping on the weekend from the log files.

The process includes:

  • Create a table and import the log files into the table.
  • Filter the data information of the time that female netizens spend online.
  • Aggregate the total time that each female netizen spends online.
  • Filter the information of female netizens who spend more than 2 hours online.

Configuration Operations Before Running

In security mode, the Spark Core sample code needs to read two files (user.keytab and krb5.conf). The user.keytab and krb5.conf files are authentication files in the security mode. Download the authentication credentials of the user principal on the FusionInsight Manager page. The user in the sample code is sparkuser, change the value to the prepared development user name.

Packaging the Project

  1. Upload the user.keytab and krb5.conf files to the server where the client is installed.
  1. Use the Maven tool provided by IDEA to pack the project and generate a JAR file. For details, see Compiling and Running the Application.
    • Before compilation and packaging, change the paths of the user.keytab and krb5.conf files in the sample code to the actual paths on the client server where the files are located. Example: /opt/female/user.keytab and /opt/female/krb5.conf.
    • To run the Python sample code, you do not need to use Maven for packaging. You only need to upload the user.keytab and krb5.conf files to the server where the client is located.
  2. Upload the JAR file to any directory (for example, /opt/female/) on the server where the Spark client is located.

Running Tasks

Go to the Spark client directory and run the following commands to invoke the bin/spark-submit script to run the code (the class name and file name must be the same as those in the actual code. The following is only an example):

  • Run the Scala and Java sample programs.

    bin/spark-submit --class com.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn --deploy-mode client /opt/female/SparkSqlScalaExample-1.0.jar <inputPath>

    <inputPath> indicates the input path in HDFS.

  • Run the Python sample program
    • The Python sample code does not provide authentication information. Configure --keytab and --principal to specify authentication information.

    bin/spark-submit --master yarn --deploy-mode client --keytab /opt/FIclient/user.keytab --principal sparkuser /opt/female/SparkPythonExample/SparkSQLPythonExample.py <inputPath>

    <inputPath> indicates the input path in HDFS.