Scenarios

Develop a DataStream application of Flink to perform the following operations on logs about dwell durations of netizens for shopping online.

The DataStream application can run in both the Windows environment and the Linux environment.

Collect statistics on female netizens who dwell on online shopping for more than 2 hours in total in a real time manner.

The first column in the log file records names, the second column records genders, and the third column records the dwell duration in the unit of minute. Three attributes are separated by commas (,).

log1.txt: logs collected on Saturday. The log file can be obtained from the data directory of the sample project.

LiuYang,female,20 
YuanJing,male,10 
GuoYijun,male,5 
CaiXuyu,female,50 
Liyuan,male,20 
FangBo,female,50 
LiuYang,female,20 
YuanJing,male,10 
GuoYijun,male,50 
CaiXuyu,female,50 
FangBo,female,60

log2.txt: logs collected on Sunday. The log file can be obtained from the data directory of the sample project.

LiuYang,female,20 
YuanJing,male,10 
CaiXuyu,female,50 
FangBo,female,50 
GuoYijun,male,5 
CaiXuyu,female,50 
Liyuan,male,20 
CaiXuyu,female,50 
FangBo,female,50 
LiuYang,female,20 
YuanJing,male,10 
FangBo,female,50 
GuoYijun,male,50 
CaiXuyu,female,50 
FangBo,female,60

Data Planning

Data of DataStream sample project is stored in a .txt file.

Place the log1.txt and log2.txt in two directories, for example, /opt/log1.txt and /opt/log2.txt.

If the data file is stored in the local file system, the data file must be stored in the specified directory on all nodes where Yarn NodeManager is deployed, and the running user access permission must be set.
Alternatively, store the data file on HDFS and set the file read path in the program to the HDFS path, for example, hdfs://hacluster/path/to/file.

Development Approach

Collect the information about female netizens who spend more than 2 hours in online shopping on the weekend from the log files.

The process includes:

Read text data, generate DataStreams, and parse data to generate UserRecord information.
Filter the data about the time that female netizens spend online.
Perform keyby operation based on the name and gender, and collect the time that female netizens spend online within a time window.
Filter data about users whose consecutive online duration exceeds the threshold, and obtain the result.

Parent topic: DataStream Application

Previous topic: DataStream Application

Next topic: Java Sample Code