Scenario Description

Develop a Spark application to perform the following operations on logs about netizens' dwell time for online shopping on a weekend.

Collect statistics on female netizens who continuously dwell on online shopping for more than half an hour in real time.

The first column in the log file records names, the second column records gender, and the third column records the dwell duration in the unit of minute. Three columns are separated by comma (,).

log1.txt: logs collected on Saturday

LiuYang,female,20
YuanJing,male,10
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60

log2.txt: logs collected on Sunday

LiuYang,female,20
YuanJing,male,10
CaiXuyu,female,50
FangBo,female,50
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
CaiXuyu,female,50
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
FangBo,female,50
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60

Data Planning

The data of the Spark Streaming sample project is stored in the Kafka component. A user with the Kafka permission is required.

Create two text files input_data1.txt and input_data2.txt on a local computer, and copy log1.txt to input_data1.txt and log2.txt to input_data2.txt.
Create the /home/data directory on the client installation node. Upload the preceding two files to the /home/data directory.
Set allow.everyone.if.no.acl.found of Kafka Broker to true. (This parameter does not need to be set for the normal cluster.)
Start the Producer of the sample code to send data to Kafka.
java -cp $SPARK_HOME/jars/*:$SPARK_HOME/jars/streamingClient/*:{JAR_PATH} com.huawei.bigdata.spark.examples.StreamingExampleProducer {BrokerList} {Topic}
- JAR_PATH indicates the path of the JAR package.
- The format of brokerlist is brokerIp:9092.

Development Guidelines

Collect statistics on female netizens who dwell on online shopping for more than half an hour on the weekend.

To achieve the objective, the process is as follows:

Receive data from Kafka and generate the corresponding DStream.
Filter data information of the time that female netizens spend online.
Summarize the total time that each female netizen spends online within a time window.
Filter data about netizens whose consecutive online duration exceeds the threshold, and obtain the results.

Parent topic: Spark Streaming Application

Previous topic: Spark Streaming Application

Next topic: Java Sample Code