Why Does the Switchover of ResourceManager Occur Continuously?

Question

The switchover of ResourceManager occurs continuously when multiple, for example 2,000, tasks are running concurrently, causing the Yarn service unavailable.

Answer

The issue arises when the duration of a full garbage collection (GC) exceeds the configured interaction threshold between the ResourceManager and ZooKeeper. As a result, the ResourceManager fails to respond within the expected time frame, causing ZooKeeper to consider it unresponsive. This leads to repeated failovers, with the ResourceManager continuously switching over to standby mode.

When there are multiple tasks, ResourceManager saves the authentication information about multiple tasks and transfers the information to NodeManagers through heartbeat, which is called heartbeat response. The lifecycle of heartbeat response is short. The default value is 1s. Normally, heartbeat response can be reclaimed during the JVM minor garbage collection. However, if there are multiple tasks and there are a lot of nodes, for example 5000 nodes, in the cluster, the heartbeat response of multiple nodes occupy a large amount of memory. As a result, the JVM cannot completely reclaim the heartbeat response during minor garbage collection. The heartbeat response failed to be reclaimed accumulate and the JVM full garbage collection is triggered. JVM garbage collection operates in blocking mode, meaning no tasks are executed during the collection process. If a full garbage collection takes longer than the configured heartbeat interval between the ResourceManager and ZooKeeper, it can trigger a failover.

Handle the issue by performing the following operations:

Log in to FusionInsight Manager.

For details about how to log in to FusionInsight Manager, see Accessing MRS Manager.
Choose Cluster > Services > Yarn > Configurations > All Configurations.
Search for yarn.yarn-site.customized.configs and add the following name and value to increase the interaction threshold between the ResourceManager and ZooKeeper.
- Name: yarn.resourcemanager.zk-timeout-ms
- Value: Set as required. The value ranges from 0 to 90,000 ms.
Save the settings. Restart the expired service or instance for the configuration to take effect.