Configuring Automatic ApplicationMaster Job Retention and Memory Allocation

Scenario

Each application (such as MapReduce and Spark jobs) starts an independent ApplicationMaster when running on YARN. The ApplicationMaster communicates with the ResourceManager of YARN to obtain resources and coordinate task execution.

In YARN, the ApplicationMasters, just like containers, run on NodeManagers. (The unmanaged ApplicationMasters are ignored in this document.) If The ApplicationMaster crashes, exits, or shuts down, the ResourceManager closes all containers managed by the ApplicationAttempt, including all containers running on the NodeManager. ResourceManager starts a new ApplicationAttempt node on another compute node.

For different types of applications, we want to handle ApplicationMaster restart events in different ways. MapReduce applications aim to prevent task loss but allow the loss of the currently running container. However, for the long-period YARN service, users may not want the service to stop due to the ApplicationMaster fault.

YARN can retain the status of the container when a new ApplicationAttempt is started. Therefore, running jobs can continue to operate without faults.

Figure 1 ApplicationMaster job preserving
Click to enlarge

Configuring Automatic ApplicationMaster Job Retention

Log in to FusionInsight Manager.

For details about how to log in to FusionInsight Manager, see Accessing MRS Manager.
Choose Cluster > Services > Yarn > Configurations > All Configurations.

Search for parameters in Table 1 and modify them as required.

**Table 1** Parameter description
Parameter	Description	Example Value
yarn.app.mapreduce.am.work-preserve	Whether to enable the ApplicationMaster job retention feature. The default value is false. true: The function is enabled. When an ApplicationMaster job fails, YARN attempts to retain the results of completed tasks. The results will be reused after a new ApplicationMaster is started. false: The function is disabled. When an ApplicationMaster job fails, YARN does not retain the results of completed tasks. After a new ApplicationMaster is started, all tasks, including those that have been completed, are re-executed.	true
yarn.app.mapreduce.am.umbilical.max.retries	Maximum number of attempts to restore running containers when the communication between the ApplicationMaster and ResourceManager fails. Increasing the number of retries can improve communication reliability and reduce job failures caused by temporary network problems. However, this may increase the resource waiting time and affect the overall job efficiency. A smaller number of retries can speed up failure handling and improve job execution efficiency. However, jobs may fail due to temporary problems. Value range: greater than 0	5
yarn.app.mapreduce.am.umbilical.retry.interval	Interval at which the running container attempts to recover when the communication between the ApplicationMaster and ResourceManager fails, in milliseconds. Increasing the retry interval can reduce the system resource consumption and system pressure caused by frequent retries in a short time. You are advised to increase the interval in high-load or high-latency scenarios. A smaller retry interval can recover the communication more quickly and reduce the job waiting time. You are advised to decrease the interval in low-load and low-latency scenarios.	10000
yarn.resourcemanager.am.max-attempts	Number of retries when the ApplicationMaster fails. Increasing the number of retries prevents ApplicationMaster startup failures caused by insufficient resources. This parameter applies to global settings of all ApplicationMaster jobs. Each ApplicationMaster can use an API to set an independent maximum number of retries. However, the number of retries cannot be greater than the global maximum number of retries. Otherwise, it will be overwritten by the global maximum number of retries. The value must be greater than or equal to 1.	5

Save the modified configuration. Restart the expired service or instance for the configuration to take effect.

Configure the ApplicationMaster to Automatically Adjust the Allocated Memory

When the ApplicationMaster creates a container, it automatically adjusts the allocated memory based on the total number of tasks, making resource utilization more flexible and improving client application fault tolerance.

Log in to FusionInsight Manager.

For details about how to log in to FusionInsight Manager, see Accessing MRS Manager.
Choose Cluster > Services > Yarn > Configurations > All Configurations.

Search for the following parameters and modify them as required.

**Table 2** Parameters for configuring automatic ApplicationMaster memory allocation
Parameter	Description	Example Value
mapreduce.job.am.memory.policy	Whether to enable this policy to automatically adjust the ApplicationMaster memory based on the number of MapReduce tasks. By default, this parameter is left empty. In this case, the policy is not enabled, and the ApplicationMaster memory is subject to the value of parameter yarn.app.mapreduce.am.resource.mb. This parameter indicates the maximum ApplicationMaster memory. The default value is 1536 MB. The value consists of five values separated by colons (:) and commas (,). The format is as follows: {baseTaskCount}:{taskStep}:{memoryStep},{minMemory}:{maxMemory} For details about the values, see Table 3. For example, if this parameter is set to 100:10:200,1024:3096, when the number of tasks exceeds 100, the memory of the container increases by 200 MB based on the value of yarn.app.mapreduce.am.resource.mb for each added 10 tasks, but the memory cannot exceed the maximum memory 3096 MB.	-

**Table 3** Configuration description
Parameter	Description	Setting Requirements
baseTaskCount	Task threshold. This parameter takes effect only when the total number of tasks (the sum of the Mapper tasks and Reduce tasks) of an application is greater than or equal to the value of this parameter.	The value cannot be empty and must be greater than 0.
taskStep	Incremental step length of tasks. This parameter and memoryStep determine the memory adjustment amount.	The value cannot be empty and must be greater than 0.
memoryStep	Incremental step length of memory. The memory capacity is increased based on the value of yarn.app.mapreduce.am.resource.mb.	The value cannot be empty and must be greater than 0. Unit: MB
minMemory	Lower limit of memory that can be automatically adjusted. If the adjusted memory is less than or equal to the lower limit, the value of yarn.app.mapreduce.am.resource.mb is used.	The value cannot be empty. It must be greater than 0 and cannot be greater than the value of maxMemory. Unit: MB
maxMemory	Upper limit of memory that can be automatically adjusted. If the adjusted memory exceeds the upper limit, the upper limit is used.	The value cannot be empty. It must be greater than 0 and cannot be less than the value of minMemory. Unit: MB