Updated on 2022-07-11 GMT+08:00

Typical Scenarios

Scenario

Develop a MapReduce application to perform the following operations on logs about dwell durations of netizens for shopping online:

  • Collect statistics on female netizens who dwell on online shopping for more than 2 hours at a weekend.
  • The first column in the log file records names, the second column records sex, and the third column records the dwell duration in the unit of minute. Three attributes are separated by commas (,).

log1.txt: logs collected on Saturday

LiuYang,female,20 
YuanJing,male,10 
GuoYijun,male,5 
CaiXuyu,female,50 
Liyuan,male,20 
FangBo,female,50 
LiuYang,female,20 
YuanJing,male,10 
GuoYijun,male,50 
CaiXuyu,female,50 
FangBo,female,60

log2.txt: logs collected on Sunday

LiuYang,female,20 
YuanJing,male,10 
CaiXuyu,female,50 
FangBo,female,50 
GuoYijun,male,5 
CaiXuyu,female,50 
Liyuan,male,20 
CaiXuyu,female,50 
FangBo,female,50 
LiuYang,female,20 
YuanJing,male,10 
FangBo,female,50 
GuoYijun,male,50 
CaiXuyu,female,50 
FangBo,female,60

Data Preparation

Save log files in the Hadoop distributed file system (HDFS).

  1. Create text files input_data1.txt and input_data2.txt on the Linux operating system, and copy log1.txt to input_data1.txt and log2.txt to input_data2.txt.
  2. Create /tmp/input on the HDFS, and run the following commands to upload input_data1.txt and input_data2.txt to /tmp/input:
    1. On the Linux client, run hdfs dfs -mkdir /tmp/input.
    2. On the Linux client, run hdfs dfs -put local_file_path /tmp/input.

Development Idea

Collects the information about female netizens who spend more than 2 hours in online shopping on the weekend from the log files.

The process includes:

  • Read the source file data.
  • Filter the data information about the time that female netizens spend online.
  • Aggregate the total time that each female netizen spends online.
  • Filter the information about female netizens who spend more than 2 hours online.