Continuous Back Pressure and Full GC Occur When Flink Interconnects with HBase

Symptom

If the source of a Flink task is Kafka and the sink is a component whose performance is lower than that of Kafka, such as HBase, the following symptoms may occur:

Symptom 1: The job is interrupted for a period of time. The following error message may be displayed, and the checkpoint fails continuously:
```
Caused by:org.apache.hadoop.ipc.RemoteException: File does not exist:/flink/checkpoints/xxxx  does not have anyopen files.
```
Symptom 2: After the job runs for a period of time, the TaskManager is faulty. The following error message may be displayed:
```
"org.apache.flink.util.FlinkException: The TaskExecutor is shutting down" or "the remote taskmanager has been lost".
```
Symptom 3: A large amount of back pressure occurs on the first operator of the job.

Possible Causes

Check whether a large number of full GC logs are generated.
Check whether the service uses a wide window. For example, the window wide is set to hours, the memory may be used up.
Check whether the performance of the sink component is poor.
For example, when data is written to components such as HBase that are poorer in performance, back pressure occurs in the entire job.

Solution

Increase the startup memory of the TaskManager.
Optimize the window function to prevent data from being stacked in the memory.
Optimize the sink components or increase the sink concurrency to prevent back pressure.

Parent topic: Common Issues About Flink

Previous topic: Quota for Submitting Flink Jobs to ZooKeeper Exceeds the Threshold

Next topic: Flink Troubleshooting