Updated on 2024-08-10 GMT+08:00

Development Plan

Overview

Create a Spark application that can perform the following operations on logs related to the duration of online shopping sessions for netizens:

  • Collect statistics on female netizens who dwell on online shopping for more than two hours during weekends.
  • The log file has three columns separated by commas (,). The first column contains names, the second contains gender, and the third contains dwell duration in minutes.

    log1.txt: logs collected on Saturday

    LiuYang,female,20
    YuanJing,male,10
    GuoYijun,male,5
    CaiXuyu,female,50
    Liyuan,male,20
    FangBo,female,50
    LiuYang,female,20
    YuanJing,male,10
    GuoYijun,male,50
    CaiXuyu,female,50
    FangBo,female,60

    log2.txt: logs collected on Sunday

    LiuYang,female,20
    YuanJing,male,10
    CaiXuyu,female,50
    FangBo,female,50
    GuoYijun,male,5
    CaiXuyu,female,50
    Liyuan,male,20
    CaiXuyu,female,50
    FangBo,female,50
    LiuYang,female,20
    YuanJing,male,10
    FangBo,female,50
    GuoYijun,male,50
    CaiXuyu,female,50
    FangBo,female,60

Preparing Data

Save the original log files in HDFS.

  1. Create two text files input_data1.txt and input_data2.txt on a local computer, and copy log1.txt to input_data1.txt and log2.txt to input_data2.txt.
  2. Create /tmp/input on HDFS client path, and run the following commands to upload input_data1.txt and input_data2.txt to /tmp/input:
    1. On the Linux HDFS client, run the hadoop fs -mkdir /tmp/input command (or the hdfs dfs command) to create a directory.
    2. Go to the /tmp/input directory on the HDFS client. On the Linux HDFS client, run the hadoop fs -put input_data1.txt /tmp/input and hadoop fs -put input_data2.txt /tmp/input commands to upload data files.

Development Guidelines

Collect statistics on female netizens who dwell on online shopping for more than two hours at weekends.

The process is as follows:

  • Create a table and import the log files into the table.
  • Filter data information of the time that female netizens spend online.
  • Summarize the total time that each female shopper spends online.
  • Filter the information of female netizens who spend more than 2 hours online.

Packaging the Project

  1. Use the Maven tool provided by IDEA to pack the project and generate a JAR file. For details, see Commissioning a Spark Application in a Linux Environment.
  2. Upload the JAR file to any directory (for example, /opt/female/) on the server where the Spark client is located.

Running the Task

Go to the Spark client directory and run the following commands to invoke the bin/spark-submit script to run the code (The class name and file name must be the same as those in the actual code. The following is only an example.):

  • Run the Scala and Java sample projects.
    • bin/spark-submit --class com.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn --deploy-mode client /opt/female/SparkSqlScalaExample-1.0.jar <inputPath>
    • <inputPath> indicates the input path in HDFS.
  • Run the Python sample project.
    • bin/spark-submit --master yarn --deploy-mode client /opt/female/SparkSQLPythonExample/SparkSQLPythonExample.py <inputPath>
    • <inputPath> indicates the input path in HDFS.