Help Center/ MapReduce Service/ Troubleshooting/ Using Flink/ Error netty.exception.RemoteTransportException Is Reported During Flink Job Runtime

Updated on 2025-08-21 GMT+08:00

View PDF

Error netty.exception.RemoteTransportException Is Reported During Flink Job Runtime

Symptom

The error message "netty.exception.RemoteTransportException" is displayed during Flink job runtime.

Possible Causes

Possible cause 1:
Full GC occurs at the heap memory level.
1. Based on the error information in the logs, we can locate the node where the faulty TaskManager is.
2. Check whether a large number of full GCs occur or the GC duration is too long in the GC logs of the node. As shown in the following figure, each GC takes more than 10s.
  
  Note: It is normal that Full GC occurs in the first several TaskManager startups. It is also normal that the full GC interval is long and the full GC duration is short. As shown in the following figure, the interval is at the hour level, and the interval does not exceed 1s.
Possible cause 2:
A thread-level memory leak (java.lang.OutOfMemoryError: unable to create new native thread) occurs on the service side.
1. You can locate the node where the faulty TaskManager is deployed based on the logs of the abnormal job.
2. The logs of the abnormal TaskManager node contain the error message unable to create new native thread type.
3. This error indicates that the abnormal TaskManager node cannot create new threads. As a result, the TaskManager node on the current node is blocked.
Possible Cause 3
The hardware of NodeManager is faulty.
1. You can locate the node where the faulty TaskManager is deployed based on the logs of the abnormal job.
2. No exception information is found in the TaskManager log.
3. A large number of disk exceptions are found in the OS logs of the NodeManager node where TaskManager is.
4. Check whether the yarn-start-stop.log file of the NodeManager process is restarted.
In addition to the preceding error information, if the disk of the NodeManager node is full, the NodeManager process restarts and all containers fail. In this case, the Flink job also fails.

Solution

Solution for Cause 1
1. Increase the startup memory of the TaskManager on the service side.
2. If backpressure exists on the service side, increase the overall service concurrency to eliminate backpressure.
If backpressure does not exist, ask the service side to check the code for memory leakage. It is recommended that automatic restart be enabled. For example, set restart-strategy: failure-rate in the flink-conf.xml file.
Solution for Cause 2
1. Check whether the OS configuration of the node is improper.
2. Determine whether memory leakage occurred based on the method of determining whether thread leakage occurred in Flink jobs. (Use jstack.)
Solution for Cause 3
1. Rectify the disk fault.
2. Enable automatic restart. For example, set restart-strategy to failure-rate in the flink-conf.xml file.