Java_MapReduce Service

Common APIs of Flink

Flink mainly uses the following APIs:

StreamExecutionEnvironment: provides the execution environment, which is the basis of Flink stream processing.
DataStream: represents streaming data. DataStream can be regarded as unmodifiable collection that contains duplicate data. The number of elements in DataStream are unlimited.
KeyedStream: generates the stream by performing keyBy grouping operations on DataStream. Data is grouped based on the specified key value.
WindowedStream: generates the stream by performing the window function on KeyedStream, sets the window type, and defines window triggering conditions.
AllWindowedStream: generates streams by performing the window function on DataStream, sets the window type, defines window triggering conditions, and performs operations on the window data.
ConnectedStreams: generates streams by connecting the two DataStreams and retaining the original data type, and performs the map or flatMap operation.
JoinedStreams: performs equijoin (which is performed when two values are equal, for example, a.id = b.id) operation to data in the window. The join operation is a special scenario of coGroup operation.
CoGroupedStreams: performs the coGroup operation on data in the window. CoGroupedStreams can perform various types of join operations.

Figure 1 Conversion of Flink stream types

Data Stream Source

**Table 1** APIs about data stream source
API	Description
public final <OUT> DataStreamSource<OUT> fromElements(OUT... data)	Obtain user-defined data of multiple elements as the data stream source. type indicates the data type of an element. data indicates the data of multiple elements.
public final <OUT> DataStreamSource<OUT> fromElements(Class<OUT> type, OUT... data)
public <OUT> DataStreamSource<OUT> fromCollection(Collection<OUT> data)	Obtain the user-defined data collection as the data stream source. type indicates the data type of elements in the collection. typeInfo indicates the type information obtained based on the element data type in the collection. data indicates the iterator.
public <OUT> DataStreamSource<OUT> fromCollection(Collection<OUT> data, TypeInformation<OUT> typeInfo)
public <OUT> DataStreamSource<OUT> fromCollection(Iterator<OUT> data, Class<OUT> type)
public <OUT> DataStreamSource<OUT> fromCollection(Iterator<OUT> data, TypeInformation<OUT> typeInfo)
public <OUT> DataStreamSource<OUT> fromParallelCollection(SplittableIterator<OUT> iterator, Class<OUT> type)	Obtain the user-defined data collection as parallel data stream source. type indicates the data type of elements in the collection. typeInfo indicates the type information obtained based on the element data type in the collection. iterator indicates the iterator that can be divided into multiple partitions.
public <OUT> DataStreamSource<OUT> fromParallelCollection(SplittableIterator<OUT> iterator, TypeInformation<OUT> typeInfo)
private <OUT> DataStreamSource<OUT> fromParallelCollection(SplittableIterator<OUT> iterator, TypeInformation<OUT> typeInfo, String operatorName)
public DataStreamSource<Long> generate Sequence(long from, long to)	Obtain a sequence of user-defined data as the data stream source. from indicates the starting point of numbers. to indicates the end point of numbers.
public DataStreamSource<String> readTextFile(String filePath)	Obtain the user-defined text file from a specific path as the data stream source. filePath indicates the path of the text file. charsetName indicates the encoding format.
public DataStreamSource<String> readTextFile(String filePath, String charsetName)
public <OUT> DataStreamSource<OUT> readFile(FileInputformat<OUT> inputformat, String filePath)	Obtain the user-defined file from a specific path as the data stream source. filePath indicates the file path. inputformat indicates the format of the file. watchType indicates the file processing mode, which can be PROCESS_ONCE or PROCESS_CONTINUOUSLY. interval indicates the interval for processing directories or files.
public <OUT> DataStreamSource<OUT> readFile(FileInputformat<OUT> inputformat, String filePath, FileProcessingMode watchType, long interval)
public <OUT> DataStreamSource<OUT> readFile(FileInputformat<OUT> inputformat, String filePath, FileProcessingMode watchType, long interval, TypeInformation<OUT> typeInformation)
public DataStreamSource<String> socketTextStream(String hostname, int port, String delimiter, long maxRetry)	Obtain user-defined socket data as the data stream source. hostname indicates the host name of the socket server. port indicates the listening port of the server. delimiter indicates the separator of messages. maxRetry refers to the maximum retry times that can be triggered by abnormal connections.
public DataStreamSource<String> socketTextStream(String hostname, int port, String delimiter)
public DataStreamSource<String> socketTextStream(String hostname, int port)
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function)	Customize the SourceFunction and addSource methods to add data sources such as Kafka. The implementation method is the run method of SourceFunction. function indicates the user-defined SourceFunction function. sourceName indicates the name of data source. typeInfo indicates the type information obtained based on the element data type.
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName)
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, TypeInformation<OUT> typeInfo)
public <OUT> DataStreamSource<OUT> addSource(SourceFunction<OUT> function, String sourceName, TypeInformation<OUT> typeInfo)

Data Output

**Table 2** APIs about data output
API	Description
public DataStreamSink<T> print()	Print data as the standard output stream.
public DataStreamSink<T> printToErr()	Print data as the standard error output stream.
public DataStreamSink<T> writeAsText(String path)	Write data to a specific text file. path indicates the path of the text file. writeMode indicates the writing mode, which can be OVERWRITE or NO_OVERWRITE.
public DataStreamSink<T> writeAsText(String path, WriteMode writeMode)
public DataStreamSink<T> writeAsCsv(String path)	Writhe data to a specific .csv file. path indicates the path of the text file. writeMode indicates the writing mode, which can be OVERWRITE or NO_OVERWRITE. rowDelimiter indicates the row separator. fieldDelimiter indicates the column separator.
public DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode)
public <X extends Tuple> DataStreamSink<T> writeAsCsv(String path, WriteMode writeMode, String rowDelimiter, String fieldDelimiter)
public DataStreamSink<T> writeToSocket(String hostName, int port, SerializationSchema<T> schema)	Write data to the socket connection. hostName indicates the host name. port indicates the port number.
public DataStreamSink<T> writeUsingOutputformat(Outputformat<T> format)	Write data to a file, for example, a binary file.
public DataStreamSink<T> addSink(SinkFunction<T> sinkFunction)	Export data in a user-defined manner. The flink-connectors allows the addSink method to support exporting data to Kafka. The implementation method is the invoke method of SinkFunction.

Filtering and Mapping

**Table 3** APIs about filtering and mapping
API	Description
public <R> SingleOutputStreamOperator<R> map(MapFunction<T, R> mapper)	Transform an element into another element of the same type.
public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper)	Transform an element into zero, one, or multiple elements.
public SingleOutputStreamOperator<T> filter(FilterFunction<T> filter)	Run a Boolean function on each element and retain elements that return true.

Aggregation

**Table 4** APIs about aggregation
API	Description
public KeyedStream<T, Tuple> keyBy(int... fields)	Logically divide a stream into multiple partitions, each containing elements with the same key. The partitioning is internally implemented using hash partition. A KeyedStream is returned. Return the KeyedStream after the KeyBy operation, and call the KeyedStream function, such as reduce, fold, min, minby, max, maxby, sum, and sumby. fields indicates the numbers of columns or names of member variables. key indicates the user-defined basis for partitioning.
public KeyedStream<T, Tuple> keyBy(String... fields)
public <K> KeyedStream<T, K> keyBy(KeySelector<T, K> key)
public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> reducer)	Perform reduce on KeyedStream in a rolling manner. Aggregate the current element and the last reduced value into a new value. The types of the three values are the same.
public <R> SingleOutputStreamOperator<R> fold(R initialValue, FoldFunction<T, R> folder)	Perform fold operation on KeyedStream based on an initial value in a rolling manner. Aggregate the current element and the last folded value into a new value. The input and returned values of Fold can be different.
public SingleOutputStreamOperator<T> sum(int positionToSum)	Calculate the sum in a KeyedStream in a rolling manner. positionToSum and field indicate calculating the sum of a specific column.
public SingleOutputStreamOperator<T> sum(String field)
public SingleOutputStreamOperator<T> min(int positionToMin)	Calculate the minimum value in a KeyedStream in a rolling manner. min returns the minimum value, without guarantee of the correctness of columns of non-minimum values. positionToMin and field indicate calculating the minimum value of a specific column.
public SingleOutputStreamOperator<T> min(String field)
public SingleOutputStreamOperator<T> max(int positionToMax)	Calculate the maximum value in KeyedStream in a rolling manner. max returns the maximum value, without guarantee of the correctness of columns of non-maximum values. positionToMax and field indicate calculating the maximum value of a specific column.
public SingleOutputStreamOperator<T> max(String field)
public SingleOutputStreamOperator<T> minBy(int positionToMinBy)	Obtain the row where the minimum value of a column locates in a KeyedStream. minBy returns all elements of that row. positionToMinBy indicates the column on which the minBy operation is performed. first indicates whether to return the first or last minimum value.
public SingleOutputStreamOperator<T> minBy(String positionToMinBy)
public SingleOutputStreamOperator<T> minBy(int positionToMinBy, boolean first)
public SingleOutputStreamOperator<T> minBy(String field, boolean first)
public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy)	Obtain the row where the maximum value of a column locates in a KeyedStream. maxBy returns all elements of that row. positionToMaxBy indicates the column on which the maxBy operation is performed. first indicates whether to return the first or last maximum value.
public SingleOutputStreamOperator<T> maxBy(String positionToMaxBy)
public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy, boolean first)
public SingleOutputStreamOperator<T> maxBy(String field, boolean first)

DataStream Distribution

**Table 5** APIs about DataStream distribution
API	Description
public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, int field)	Use a user-defined partitioner to select target task for each element. partitioner indicates the user-defined method for repartitioning. field indicates the input parameters of partitioner. keySelector indicates the user-defined input parameters of partitioner.
public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, String field)
public <K> DataStream<T> partitionCustom(Partitioner<K> partitioner, KeySelector<T, K> keySelector)
public DataStream<T> shuffle()	Randomly and evenly partition elements.
public DataStream<T> rebalance()	Partition elements in round-robin manner, ensuring load balance of each partition. This partitioning is helpful to data with skewness.
public DataStream<T> rescale()	Distribute elements into downstream subset in round-robin manner.
public DataStream<T> broadcast()	Broadcast each element to all partitions.

Project Capabilities

**Table 6** APIs about projecting
API	Description
public <R extends Tuple> SingleOutputStreamOperator<R> project(int... fieldIndexes)	Select some field subset from the tuple. fieldIndexes indicates some sequences of the tuple. NOTE: Only tuple data type is supported by the project API.

Configuring the eventtime Attribute

**Table 7** APIs about configuring the eventtime attribute
API	Description
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPeriodicWatermarks<T> timestampAndWatermarkAssigner)	Extract timestamp from records, enabling event time window to trigger computing.
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(AssignerWithPunctuatedWatermarks<T> timestampAndWatermarkAssigner)

Table 8 lists differences of AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks APIs.

**Table 8** Difference of AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks APIs
Parameter	Description
AssignerWithPeriodicWatermarks	Generate Watermark based on the getConfig().setAutoWatermarkInterval(200L) timestamp of StreamExecutionEnvironment class.
AssignerWithPunctuatedWatermarks	Generate a Watermark each time an element is received. Watermarks can be different based on received elements.

Iteration

**Table 9** APIs about iteration
API	Description
public IterativeStream<T> iterate()	In the flow, create in a feedback loop to redirect the output of an operator to a preceding operator. NOTE: This API is helpful to algorithms that require constant update of models. long maxWaitTimeMillis: The timeout period of each round of iteration.
public IterativeStream<T> iterate(long maxWaitTimeMillis)

Stream Splitting

**Table 10** APIs about stream splitting
API	Description
public SplitStream<T> split(OutputSelector<T> outputSelector)	Use OutputSelector to rewrite select method, specify stream splitting basis (by tagging), and construct SplitStream streams. That is, a string is tagged for each element, so that a stream of a specific tag can be selected or created.
public DataStream<OUT> select(String... outputNames)	Select one or multiple streams from a SplitStream. outputNames indicates the sequence of string tags created for each element using the split method.

Window

Windows can be classified as tumbling windows and sliding windows.

APIs such as Window, TimeWindow, CountWindow, WindowAll, TimeWindowAll, and CountWindowAll can be used to generate windows.
APIs such as Window Apply, Window Reduce, Window Fold, and Aggregations on windows can be used to operate windows.
Multiple Window Assigners are supported, including TumblingEventTimeWindows, TumblingProcessingTimeWindows, SlidingEventTimeWindows, SlidingProcessingTimeWindows, EventTimeSessionWindows, ProcessingTimeSessionWindows, and GlobalWindows.
ProcessingTime, EventTime, and IngestionTime are supported.
Timestamp modes EventTime: AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks are supported.

Table 11 lists APIs for generating windows.

**Table 11** APIs for generating windows
API	Description
public <W extends Window> WindowedStream<T, KEY, W> window(WindowAssigner<? super T, W> assigner)	Define windows in partitioned KeyedStreams. A window groups each key according to some characteristics (for example, data received within the latest 5 seconds.
public <W extends Window> AllWindowedStream<T, W> windowAll(WindowAssigner<? super T, W> assigner)	Define windows in DataStreams.
public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size)	Define time windows in partitioned KeyedStreams, select ProcessingTime or EventTime based on the environment.getStreamTimeCharacteristic() parameter, and determine whether the window is TumblingWindow or SlidingWindow depends on the number of parameters. size indicates the duration of the window. slide indicates the sliding time of window. NOTE: WindowedStream and AllWindowedStream indicates two types of streams. If the API contains only one parameter, the window is TumblingWindow. If the API contains two or more parameters, the window is SlidingWindow.
public WindowedStream<T, KEY, TimeWindow> timeWindow(Time size, Time slide)
public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size)	Define time windows in partitioned DataStream, select ProcessingTime or EventTime based on the environment.getStreamTimeCharacteristic() parameter, and determine whether the window is TumblingWindow or SlidingWindow depends on the number of parameters. size indicates the duration of the window. slide indicates the sliding time of window.
public AllWindowedStream<T, TimeWindow> timeWindowAll(Time size, Time slide)
public WindowedStream<T, KEY, GlobalWindow> countWindow(long size)	Divide windows according to the number of elements and define windows in partitioned KeyedStreams. size indicates the duration of the window. slide indicates the sliding time of window. NOTE: WindowedStream and AllWindowedStream indicates two types of streams. If the API contains only one parameter, the window is TumblingWindow. If the API contains two or more parameters, the window is SlidingWindow.
public WindowedStream<T, KEY, GlobalWindow> countWindow(long size, long slide)
public AllWindowedStream<T, GlobalWindow> countWindowAll(Time size)	Divide windows according to the number of elements and define windows in DataStreams. size indicates the duration of the window. slide indicates the sliding time of window.
public AllWindowedStream<T, GlobalWindow> countWindowAll(Time size, Time slide)

Table 12 lists APIs for operating windows.

**Table 12** APIs for operating windows
Method	API	Description
Window	public <R> SingleOutputStreamOperator<R> apply(WindowFunction<T, R, K, W> function)	Apply a general function to a window. The data in the window is calculated as a whole. function indicates the window function to be executed. resultType indicates the type of returned data.
	public <R> SingleOutputStreamOperator<R> apply(WindowFunction<T, R, K, W> function, TypeInformation<R> resultType)
	public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> function)	Apply a reduce function to the window and return the result. reduceFunction indicates the reduce function to be executed. function of WindowFunction indicates triggering an operation to the window after a reduce operation. resultType indicates the type of returned data.
	public <R> SingleOutputStreamOperator<R> reduce(ReduceFunction<T> reduceFunction, WindowFunction<T, R, K, W> function)
	public <R> SingleOutputStreamOperator<R> reduce( ReduceFunction<T> reduceFunction, WindowFunction<T, R, K, W> function, TypeInformation<R> resultType)
	public <R> SingleOutputStreamOperator<R> fold(R initialValue, FoldFunction<T, R> function)	Apply a fold function to the window and return the result. initialValue indicates the initial value. foldFunction indicates the fold function. function of WindowFunction indicates triggering an operation to the window after a fold operation. resultType indicates the type of returned data.
	public <R> SingleOutputStreamOperator<R> fold(R initialValue, FoldFunction<T, R> function, TypeInformation<R> resultType)
	public <ACC, R> SingleOutputStreamOperator<R> fold(ACC initialValue, FoldFunction<T, ACC> foldFunction, WindowFunction<ACC, R, K, W> function)
	public <ACC, R> SingleOutputStreamOperator<R> fold(ACC initialValue, FoldFunction<T, ACC> foldFunction, WindowFunction<ACC, R, K, W> function, TypeInformation<ACC> foldAccumulatorType, TypeInformation<R> resultType)
WindowAll	public <R> SingleOutputStreamOperator<R> apply(AllWindowFunction<T, R, W> function)	Apply a general function to a window. The data in the window is calculated as a whole.
	public <R> SingleOutputStreamOperator<R> apply(AllWindowFunction<T, R, W> function, TypeInformation<R> resultType)
	public SingleOutputStreamOperator<T> reduce(ReduceFunction<T> function)	Apply a reduce function to the window and return the result. reduceFunction indicates the reduce function to be executed. AllWindowFunction indicates triggering an operation to the window after a reduce operation. resultType indicates the type of returned data.
	public <R> SingleOutputStreamOperator<R> reduce(ReduceFunction<T> reduceFunction, AllWindowFunction<T, R, W> function)
	public <R> SingleOutputStreamOperator<R> reduce(ReduceFunction<T> reduceFunction, AllWindowFunction<T, R, W> function, TypeInformation<R> resultType)
	public <R> SingleOutputStreamOperator<R> fold(R initialValue, FoldFunction<T, R> function)	Apply a fold function to the window and return the result. initialValue indicates the initial value. foldFunction indicates the fold function. function of WindowFunction indicates triggering an operation to the window after a fold operation. resultType indicates the type of returned data.
	public <R> SingleOutputStreamOperator<R> fold(R initialValue, FoldFunction<T, R> function, TypeInformation<R> resultType)
	public <ACC, R> SingleOutputStreamOperator<R> fold(ACC initialValue, FoldFunction<T, ACC> foldFunction, AllWindowFunction<ACC, R, W> function)
	public <ACC, R> SingleOutputStreamOperator<R> fold(ACC initialValue, FoldFunction<T, ACC> foldFunction, AllWindowFunction<ACC, R, W> function, TypeInformation<ACC> foldAccumulatorType, TypeInformation<R> resultType)
Window and WindowAll	public SingleOutputStreamOperator<T> sum(int positionToSum)	Sum a specified column of the window data. field and positionToSum indicate a specific column of the data.
	public SingleOutputStreamOperator<T> sum(String field)
	public SingleOutputStreamOperator<T> min(int positionToMin)	Calculate the minimum value of a specified column of the window data. min returns the minimum value, without guarantee of the correctness of columns of non-minimum values. positionToMin and field indicate calculating the minimum value of a specific column.
	public SingleOutputStreamOperator<T> min(String field)
	public SingleOutputStreamOperator<T> minBy(int positionToMinBy)	Obtain the row where the minimum value of a column locates in the window data. minBy returns all elements of that row. positionToMinBy indicates the column on which the minBy operation is performed. first indicates whether to return the first or last minimum value.
	public SingleOutputStreamOperator<T> minBy(String positionToMinBy)
	public SingleOutputStreamOperator<T> minBy(int positionToMinBy, boolean first)
	public SingleOutputStreamOperator<T> minBy(String field, boolean first)
	public SingleOutputStreamOperator<T> max(int positionToMax)	Calculate the maximum value of a specified column of the window data. max returns the maximum value, without guarantee of the correctness of columns of non-maximum values. positionToMax and field indicate calculating the maximum value of a specific column.
	public SingleOutputStreamOperator<T> max(String field)
	The default public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy)//true	Obtain the row where the maximum value of a column locates in the window data. maxBy returns all elements of that row. positionToMaxBy indicates the column on which the maxBy operation is performed. first indicates whether to return the first or last maximum value.
	The default public SingleOutputStreamOperator<T> maxBy(String positionToMaxBy)//true
	public SingleOutputStreamOperator<T> maxBy(int positionToMaxBy, boolean first)
	public SingleOutputStreamOperator<T> maxBy(String field, boolean first)

Combining Multiple DataStreams

**Table 13** APIs about combining multiple DataStreams
API	Description
public final DataStream<T> union(DataStream<T>... streams)	Perform union operation on multiple DataStreams to generate a new data stream containing all data from original DataStreams. NOTE: If you perform union operation on a piece of data with itself, there are two copies of the same data.
public <R> ConnectedStreams<T, R> connect(DataStream<R> dataStream)	Connect two DataStreams and retain the original type. The connect API method allows two DataStreams to share statuses. After ConnectedStreams is generated, map or flatMap operation can be called.
public <R> SingleOutputStreamOperator<R> map(CoMapFunction<IN1, IN2, R> coMapper)	Perform mapping operation, which is similar to map and flatMap operation in a ConnectedStreams, on elements. After the operation, the type of the new DataStreams is string.
public <R> SingleOutputStreamOperator<R> flatMap(CoFlatMapFunction<IN1, IN2, R> coFlatMapper)	Perform mapping operation, which is similar to flatMap operation in DataStream, on elements.

Join Operation

**Table 14** APIs about join operation
API	Description
public <T2> JoinedStreams<T, T2> join(DataStream<T2> otherStream)	Join two DataStreams using a given key in a specified window.
public <T2> CoGroupedStreams<T, T2> coGroup(DataStream<T2> otherStream)	Co-group two DataStreams using a given key in a specified window.

Java