Updated on 2022-09-14 GMT+08:00

Scenario Description

Scenario Description

Develop a Spark application to perform the following operations on logs about netizens' dwell time for online shopping on a weekend.

  • Collect statistics on female netizens who dwell on online shopping for more than 2 hours on the weekend.
  • The first column in the log file records names, the second column records gender, and the third column records the dwell duration in the unit of minute. Three columns are separated by comma (,).

    log1.txt: logs collected on Saturday

    LiuYang,female,20
    YuanJing,male,10
    GuoYijun,male,5
    CaiXuyu,female,50
    Liyuan,male,20
    FangBo,female,50
    LiuYang,female,20
    YuanJing,male,10
    GuoYijun,male,50
    CaiXuyu,female,50
    FangBo,female,60

    log2.txt: logs collected on Sunday

    LiuYang,female,20
    YuanJing,male,10
    CaiXuyu,female,50
    FangBo,female,50
    GuoYijun,male,5
    CaiXuyu,female,50
    Liyuan,male,20
    CaiXuyu,female,50
    FangBo,female,50
    LiuYang,female,20
    YuanJing,male,10
    FangBo,female,50
    GuoYijun,male,50
    CaiXuyu,female,50
    FangBo,female,60

Data Planning

Save the original log files in the HDFS.

  1. Create two text files input_data1.txt and input_data2.txt on a local computer, and copy log1.txt to input_data1.txt and log2.txt to input_data2.txt.
  2. Create the /tmp/input folder in the HDFS, and run the following commands to upload input_data1.txt and input_data2.txt to the /tmp/input directory:
    1. On the HDFS client, run the following commands for authentication:

      cd /opt/client

      kinit -kt '/opt/client/Spark/spark/conf/user.keytab' <Service user for authentication>

      Specify the path of the user.keytab file based on the site requirements.

    2. On the HDFS client running the Linux OS, run the hadoop fs -mkdir /tmp/input command (or the hdfs dfs command) to create a directory.
    3. On the HDFS client running the Linux OS, run the hadoop fs -put input_xxx.txt /tmp/input command to upload the data file.

Development Guidelines

Collect statistics on female netizens who dwell on online shopping for more than 2 hours on the weekend.

To achieve the objective, the process is as follows:

  • Read original file data.
  • Filter data information of the time that female netizens spend online.
  • Summarize the total time that each female netizen spends online.
  • Filter the information of female netizens who spend more than 2 hours online.