After an Active/Standby Switchover of ResourceManager Occurs, a Task Is Interrupted and Runs for a Long Time

Question

During the running of a MapReduce task, active/standby switchover of ResourceManager occurs. After the switchover is complete, the MapReduce task continues to execute, but runs for an excessively long time.

Answer

The ResourceManager HA function has been enabled, but the Work-preserving RM restart function is not enabled.

If the Work-preserving RM restart function is not enabled, the container will be killed during the ResourceManager switchover. As a result, Application Master times out. For details about the Work-preserving RM restart function, visit the following website:

Versions earlier than MRS 3.2.0: http://hadoop.apache.org/docs/r3.1.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

MRS 3.2.0 or later: https://hadoop.apache.org/docs/r3.3.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerRestart.html

To enable the Work-preserving RM restart function, perform the following steps:

Go to the All Configurations page of the YARN service by referring to Modifying Cluster Service Configuration Parameters, enter yarn.resourcemanager.work-preserving-recovery.enabled in the search box, and set the parameter value to true. After saving the configuration, restart the instance whose YARN configuration has expired during off-peak hours.

Parent topic: Common Issues About MapReduce

Previous topic: Common Issues About MapReduce

Next topic: How Do I Handle the Problem that MapReduce Task Has No Progress for a Long Time?