Instance
Instance Description
- Collect statistics on female netizens who dwell on online shopping for more than 2 hours at a weekend.
- The first column in the log file records names, the second column records gender, and the third column records the dwell duration in the unit of minute. Three attributes are separated by commas (,).
log1.txt: logs collected on Saturday
LiuYang,female,20 YuanJing,male,10 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
log2.txt: logs collected on Sunday
LiuYang,female,20 YuanJing,male,10 CaiXuyu,female,50 FangBo,female,50 GuoYijun,male,5 CaiXuyu,female,50 Liyuan,male,20 CaiXuyu,female,50 FangBo,female,50 LiuYang,female,20 YuanJing,male,10 FangBo,female,50 GuoYijun,male,50 CaiXuyu,female,50 FangBo,female,60
Data Preparation
Save log files in the Hadoop distributed file system (HDFS).
- Create text files input_data1.txt and input_data2.txt on a local computer, and copy the log contents of log1.txt to input_data1.txt and log2.txt to input_data2.txt respectively.
- Create /tmp/input on HDFS client path, and run the following commands to upload input_data1.txt and input_data2.txt to /tmp/input:
- On the Linux FusionInsight client, run hadoop fs -mkdir /tmp/input (a hdfs dfs command provides the same function).
- Go to the /tmp/input directory on the HDFS client, on the Linux FusionInsight client, run hadoop fs -put input_data1.txt /tmp/input and hadoop fs -put input_data2.txt /tmp/input.
Development Idea
Collects the information of female netizens who spend more than 2 hours in online shopping on the weekend from the log files.
The process includes:
- Read the source file data.
- Filter the data information of the time that female netizens spend online.
- Aggregate the total time that each female netizen spends online.
- Filter the information of female netizens who spend more than 2 hours online.
Packaging the Project
- Use the Maven tool provided by IDEA to pack the project and generate a JAR file. For details, see Compiling and Running the Application.
- Upload the JAR file to any directory (for example, /opt/female/) on the server where the Spark client is located
Running Tasks
Go to the Spark client directory and run the following commands to invoke the bin/spark-submit script to run the code (The class name and file name must be the same as those in the actual code. The following is only an example):
- Run the Scala and Java sample programs.
- bin/spark-submit --class com.huawei.bigdata.spark.examples.FemaleInfoCollection --master yarn --deploy-mode client /opt/female/FemaleInfoCollection-1.0.jar <inputPath>
- <inputPath> indicates the input path in HDFS
- Run the Python sample program.
- bin/spark-submit --master yarn --deploy-mode client /opt/female/SparkPythonExample/collectFemaleInfo.py <inputPath>
- <inputPath> indicates the input path in HDFS.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.