Application Development

Overview

Flink is a unified computing framework that supports both batch processing and stream processing. It provides a stream data processing engine that supports data distribution and parallel computing. Flink features stream processing and is a top open-source stream processing engine in the industry.

Flink provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability, making it suitable for low-latency data processing.

Figure 1 shows the technology stack of Flink.

Figure 1 Technology stack of Flink
Click to enlarge

The following lists the key features of Flink in the current version:

DataStream
Checkpoint
Window
Job Pipeline
Configuration Table

Architecture

Figure 2 shows the architecture of Flink.

Figure 2 Flink architecture
Click to enlarge

As shown in Figure 2, the entire Flink system consists of three parts:

Client
Flink client is used to submit jobs (streaming jobs) to Flink.
TaskManager
TaskManager (also called worker) is a service execution node of Flink. It executes specific tasks. A Flink system could have multiple TaskManagers. These TaskManagers are equivalent to each other.
JobManager
JobManager (also called master) is a management node of Flink. It manages all TaskManagers and schedules tasks submitted by users to specific TaskManagers. In high-availability (HA) mode, multiple JobManagers are deployed. Among these JobManagers, one of which is selected as the leader, and the others are standby.

Flink provides the following features:

Low latency
Millisecond-level processing capability.
Exactly once
Asynchronous snapshot mechanism, ensuring that all data is processed only once.
High availability
Leader/Standby JobManagers, preventing single point of failure (SPOF).
Scale out
Manual scale out supported by TaskManagers.

Flink Development APIs

Flink DataStream API can be developed using Scala and Java languages, as shown in Table 1.

**Table 1** Flink DataStream API
Function	Description
Scala API	API in Scala, which can be used for data processing, such as filtering, joining, windowing, and aggregation. The Scala API is recommended for development because Scala is concise and easy to understand.
Java API	API in Java, which can be used for data processing, such as filtering, joining, windowing, and aggregation.

Flink Basic Concepts

DataStream
DataStream is the minimum unit of Flink processing and is one of core concepts of Flink. DataStreams are initially imported from external systems in formats of socket, Kafka, and files. After being processed by Flink, DataStreams are exported to external systems in formats of socket, Kafka, and files.
Data Transformation
A data transformation is a data processing unit that transforms one or multiple DataStreams into a new DataStream.

Data transformation can be classified as follows:
- One-to-one transformation, for example, map.
- One-to-zero, one-to-one, or one-to-multiple transformation, for example, flatMap.
- One-to-zero or one-to-one transformation, for example, filter.
- Multiple-to-one transformation, for example, union.
- Transformation of multiple aggregations, for example, window and keyby.
Checkpoint
Checkpoint is the most important Flink mechanism to ensure reliable data processing. Checkpoints ensure that all application statuses can be recovered from a checkpoint in case of failure occurs and data is processed exactly once.
Savepoint
Savepoints are externally stored checkpoints that you can use to stop-and-resume or update your Flink programs. After the upgrade, you can set the task status to the savepoint storage status and start the restoration, ensuring data continuity.