After an Active/Standby Switchover of ResourceManager Occurs, a Task Is Interrupted and Runs for a Long Time

Question

During the running of a MapReduce task, active/standby switchover of ResourceManager occurs. After the switchover is complete, the MapReduce task continues to execute, but runs for an excessively long time.

Answer

The ResourceManager HA function has been enabled, but the Work-preserving RM restart function is not enabled.

If the Work-preserving RM restart function is not enabled, the container will be killed during the ResourceManager switchover. As a result, Application Master times out. For details about the Work-preserving RM restart function, visit the following website:

Versions earlier than MRS 3.2.0: http://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

MRS 3.2.0 or later: https://hadoop.apache.org/docs/r3.3.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

To resolve this issue, perform the following operation:

Set the yarn.resourcemanager.work-preserving-recovery.enabled parameter to true to enable the Work-preserving RM restart function.

yarn.resourcemanager.work-preserving-recovery.enabled=true

Parent topic: Common Issues About MapReduce

Previous topic: Common Issues About MapReduce

Next topic: Why Does a MapReduce Task Stay Unchanged for a Long Time?