Updated on 2022-09-14 GMT+08:00

Scenario Description

Develop a DataStream application of Flink to perform the following operations on logs about dwell durations of netizens for shopping online on a weekend:

The DataStream application can run in Windows- and Linux-based environments.

  • Collect statistics on female netizens who dwell on online shopping for more than 2 hours in real time.
  • The first column in the log file records names, the second column records gender, and the third column records the dwell duration in the unit of minute. Three columns are separated by comma (,).

log1.txt: logs collected on Saturday.

LiuYang,female,20
YuanJing,male,10
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60

log2.txt: logs collected on Sunday.

LiuYang,female,20
YuanJing,male,10
CaiXuyu,female,50
FangBo,female,50
GuoYijun,male,5
CaiXuyu,female,50
Liyuan,male,20
CaiXuyu,female,50
FangBo,female,50
LiuYang,female,20
YuanJing,male,10
FangBo,female,50
GuoYijun,male,50
CaiXuyu,female,50
FangBo,female,60

Data Planning

Data of the DataStream sample project is stored in TXT format.

Store the log1.txt and log2.txt files in a path of the user development program, for example, /opt/log1.txt and /opt/log2.txt.

Development Guidelines

Collect statistics on female netizens who dwell on online shopping for more than 2 hours on the weekend.

To achieve the objective, the process is as follows:

  1. Read text data, generate DataStreams, and parse data to generate UserRecord information.
  2. Filter data information of the time that female netizens spend online.
  3. Perform keyby operation based on the name and gender, and summarize the total time that each female netizen spends online within a time window.
  4. Filter data about netizens whose online duration exceeds the threshold, and obtain the results.