Scenario Description
Scenario Description
Develop a Spark application to perform the following operations on logs about netizens' dwell time for online shopping on a weekend.
- Collect statistics on female netizens who dwell on online shopping for more than 2 hours on the weekend.
- The first column in the log file records names, the second column records gender, and the third column records the dwell duration in the unit of minute. Three columns are separated by comma (,).
log1.txt: logs collected on Saturday
LiuYang,female,20 YuanJing,male,10 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
log2.txt: logs collected on Sunday
LiuYang,female,20 YuanJing,male,10 CaiXuyu,female,50 FangBo,female,50 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 CaiXuyu,female,50 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 FangBo,female,50 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
Data Planning
Save the original log files in the HDFS.
- Create two text files input_data1.txt and input_data2.txt on a local computer, and copy log1.txt to input_data1.txt and log2.txt to input_data2.txt.
- Create the /tmp/input folder in the HDFS, and run the following commands to upload input_data1.txt and input_data2.txt to the /tmp/input directory:
- On the HDFS client, run the following commands for authentication:
kinit -kt '/opt/client/Spark/spark/conf/user.keytab' <Service user for authentication>
Specify the path of the user.keytab file based on the site requirements.
- On the HDFS client running the Linux OS, run the hadoop fs -mkdir /tmp/input command (or the hdfs dfs command) to create a directory.
- On the HDFS client running the Linux OS, run the hadoop fs -put input_xxx.txt /tmp/input command to upload the data file.
- On the HDFS client, run the following commands for authentication:
Development Guidelines
Collect statistics on female netizens who dwell on online shopping for more than 2 hours on the weekend.
To achieve the objective, the process is as follows:
- Create a table and import the log files into the table.
- Filter data information of the time that female netizens spend online.
- Summarize the total time that each female netizen spends online.
- Filter the information of female netizens who spend more than 2 hours online.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.