Configuring Automatic Container Information Loading After YARN Instance Restart

Scenario

The Yarn Restart feature includes ResourceManager Restart and NodeManager Restart.

When ResourceManager Restart is enabled, the standby ResourceManager node loads the information of the previous active ResourceManager node, and takes over container status information on all NodeManager nodes to continue service running. In this way, status information can be saved by periodically executing checkpoint operations, avoiding data loss.
When NodeManager Restart is enabled, NodeManager saves the information of the containers running on the current node locally. After the NodeManager service is restarted, the NodeManager service restores the saved status information, preventing the loss of the container progress running on the node.

Configuring ResourceManager Restart

Log in to FusionInsight Manager.
Choose Cluster > Services > Yarn > Configurations > All Configurations.

Search for following parameters and modify them as required.

**Table 1** Parameter description of ResourceManager Restart
Parameter	Description	Default Value
yarn.resourcemanager.recovery.enabled	Whether to enable the function of restoring the ResourceManager status after startup. true: The function is enabled. The ResourceManager enables the restoration mechanism to restore its status after a fault occurs. After this function is enabled, you need to set the yarn.resourcemanager.store.class parameter. false: The function is disabled. ResourceManager does not automatically save its status. If ResourceManager is faulty, it needs to be initialized.	true
yarn.resourcemanager.store.class	State-store class used to store the application and task statuses and certificate content. The options are as follows: org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: ZooKeeper-based state-store class. org.apache.hadoop.yarn.server.resourcemanager.recovery.AsyncZKRMStateStore: ZooKeeper-based state-store class for asynchronously starting applications. org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSysteMapReduceMStateStore: Hadoop file system-based state-store class, which is similar to HDFS. If this option is selected, you must set the yarn.resourcemanager.store.class parameter.	org.apache.hadoop.yarn.server.resourcemanager.recovery.AsyncZKRMStateStore
yarn.resourcemanager.zk-state-store.parent-path	Full path of the ZooKeeper znode that stores the ResourceManager status. If yarn.resourcemanager.store.class is set to org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSysteMapReduceMStateStore, this parameter is mandatory.	/rmstore
yarn.resourcemanager.work-preserving-recovery.enabled	Whether to enable the ResourceManager task retention and restoration function. This parameter is used only for YARN feature verification. true: The function is enabled. ResourceManager can retain the previous working status after a fault occurs. false: The function is disabled. ResourceManager does not enable the task retention and restoration mechanism. Resources and tasks need to be initialized after faults are rectified.	true
yarn.resourcemanager.state-store.async.load	Whether to enable asynchronous restoration for completed applications. true: The function is enabled. ResourceManager loads the saved status information from the status storage in asynchronous mode when it is started. false: The function is disabled. ResourceManager loads the status information in synchronous mode when it is started, which may increase the startup time.	true
yarn.resourcemanager.zk-state-store.num-fetch-threads	Number of threads for loading tasks after the asynchronous mode is enabled. When the asynchronous restoration is enabled, increasing the number of task threads can speed up the restoration of task information stored in ZooKeeper. Value range: greater than 0	20

Save the modified configuration. Restart the expired instance for the configuration to take effect.

Configuring the NodeManager Restart Feature

Log in to FusionInsight Manager.
Choose Cluster > Services > Yarn > Configurations > All Configurations.

Search for following parameters and modify them as required.

**Table 2** Parameters of NodeManager Restart
Parameter	Description	Default Value
yarn.nodemanager.recovery.enabled	Whether to enable the failure log collection function and restore unfinished applications when NodeManager is restarted. true: The function is enabled. NodeManager restores the tasks or containers that were running before the node is faulty. false: The function is disabled. NodeManager does not automatically save its status. If a fault occurs, NodeManager needs to be initialized.	true
yarn.nodemanager.recovery.dir	Local directory used by NodeManager to store the container status. When the restoration function is enabled for NodeManager, the task status information is saved to this directory so that tasks can be restored after a node fault occurs. This parameter is valid only when yarn.nodemanager.recovery.enabled is set to true. CAUTION: Exercise caution when you modify the configuration. If the configuration is incorrect, the services are unavailable. If the value of this configuration item at the role level is changed, the value of this configuration item at all instance levels will be changed. If the value of this configuration item at the instance level is changed, the value of this configuration item of other instances remains unchanged.	${SRV_HOME}/tmp/yarn-nm-recovery
yarn.nodemanager.recovery.supervised	Whether NodeManager runs in supervision mode. true: The function is enabled. After this feature is enabled, NodeManager does not clear containers after exiting. Instead, NodeManager attempts to restart and restore containers. false: The function is disabled. NodeManager does not attempt to restart and restore containers after exiting.	true