Optimizing Flink Performance

Overview

Flink is a unified computing framework that supports both batch processing and stream processing. It provides a stream data processing engine that supports data distribution and parallel computing. Flink features stream processing and is a top open-source stream processing engine in the industry. Flink provides high-concurrency pipeline data processing, millisecond-level latency, and high reliability, making it suitable for low-latency data processing.

Cluster Service Deployment Architecture

Service Scale and Capacity Parameter Configuration

Flink, a stream data processing engine, relies on memory and CPU resources. You must plan memory and CPU resources based on the current service capacity and growth rate. The following aspects must be considered:

The CPU and memory resources must be planned based on service objectives. Set JobManager memory, the number of TaskManagers, TaskManager memory, the number of slots in each TaskManager, and CPU cores based on current data distribution and service complexity.
When planning the memory, reserve 20% of memory space (recommended) as the buffer cache of the operating system.
Consider how data will expand after data blocks read from HDFS are decompressed.
Plan certain disks for cache space for cache data and logs.

Optimization Objective

Flink performance optimization aims at efficiently completing service objectives without affecting the normal operation of other services. To achieve this goal, the system generally maximizes the utilization of physical cluster resources such as the CPU, memory, and disk I/O. In this case, certain resources may become bottlenecks.

Optimization Principles

Improve CPU utilization and reduce extra performance costs.

Improve memory utilization.

Optimize the service logic to reduce processing workloads and I/O operations.

Typical Performance Optimization Method with DataStream

Configuring memory: Adjust the ratio of the old to the new generation resources. When developing Flink applications, optimize the data partitioning or grouping of DataStream.
Configuring the degree of parallelism (DOP): Set the DOP parameter based on the actual memory, CPU, data, and application logic. Tasks can be arranged based on their level of parallelism, in descending order of priority, as specified by the operator layer, execution environment layer, client layer, and system layer.
Configuring process parameters: Set JobManager memory, number of TaskManagers, number of TaskManager slots, and TaskManager memory.
Partitioning design methods: random partitioning, rebalancing (round-robin partitioning of elements to balance load in each partition), rescaling (round-robin partitioning of elements to distribute them in subsets of downstream operations), broadcast partitioning (broadcast each element to all partitions), and custom partitioning
Configuring the Netty network communications: Modify configurations in the conf/flink-conf.yaml file on the client.

Metrics Monitoring

Performance metrics include throughput, resource utilization, and scalability.

Throughput: Run the same computing tasks in the same resource environment and check how fast the task completes.
Resource usage: Run computing tasks and monitor the CPU, memory, and network usages under different loads.
Scalability:

− Performance improvement curve with horizontal scaling: Run the same computing tasks before and after resource scaling and then compare system performance.

− Performance degradation curve as system loads increase: Increase the system load in the same resource environment and compare system performance before and after the change.

Parent topic: Optimizing Big Data Performance

Previous topic: Optimizing Spark Performance

Next topic: Cost Optimization Pillar