Updated on 2024-11-13 GMT+08:00

ALM-18022 Insufficient Yarn Queue Resources

Description

  • Versions Earlier Than MRS 3.3.1: The alarm module checks Yarn queue resources every 60 seconds. This alarm is generated when available resources or ApplicationMaster (AM) resources of a queue are insufficient.

    This alarm is cleared when available resources are sufficient.

  • MRS 3.3.1 and later versions: The alarm module checks YARN queue resources periodically (controlled by the alarm.resource.lack.check.times.threshold parameter, in minutes). When the available queue resources or ApplicationMaster (AM) queue resources are insufficient:
    • If alarm.resource.lack.enable is set to true and alarm.resource.lack.enable.queues is left blank, all queues are allowed to trigger this alarm.
    • If alarm.resource.lack.enable is set to true and alarm.resource.lack.enable.queues is set to a queue name, only the specified queue is allowed to report this alarm.
    • If alarm.resource.lack.enable is set to false, all queues are not allowed to report this alarm.

    To set the preceding parameters, choose Cluster > Services > Yarn. On the displayed page, click Configurations > All Configurations on FusionInsight Manager.

    This alarm is cleared when available resources are sufficient.

Attribute

Alarm ID

Alarm Severity

Auto Clear

18022

Minor

Yes

Parameters

Parameter Name

Description

Source

Specifies the cluster for which the alarm is generated.

QueueName

Specifies the queue for which the alarm is generated.

QueueMetric

Specifies the metric of the queue for which the alarm is generated.

Trigger Condition

Specifies the threshold triggering the alarm. If the current indicator value exceeds this threshold, the alarm is generated.

Impact on the System

  • An application being executed takes longer time.
  • An application fails to be executed for a long time after being submitted.

Possible Causes

  • Alarm reporting needs to be adjusted (applicable only to MRS 3.3.1 or later).
  • NodeManager node resources are insufficient.
  • The configured maximum resource capacity of the queue is excessively small.
  • The configured maximum AM resource percentage is excessively small.

Procedure

Adjusting the alarm reporting mechanism (applicable only to MRS 3.3.1 or later)

  1. Check whether all queues need to report this alarm.

    • If no queue needs to report alarms, log in to FusionInsight Manager, choose Cluster > Services > Yarn. On the displayed page, click Configurations > All Configurations, search for alarm.resource.lack.enable, change the value to false, and save the configuration.
    • If only some queues need to report alarms: Log in to FusionInsight Manager, choose Cluster > Services > Yarn. On the displayed page, click Configurations > All Configurations, search for alarm.resource.lack.enable.queues and change the value to the name of the queue for which this alarm needs to be reported, and save the configuration.
    • If alarms need to be reported for all queues, go to 3.

  2. Check whether the alarm is cleared 5 minutes later.

    • If yes, no further action is required.
    • If no, go to 3.

Check NodeManager resources.

  1. On the FusionInsight Manager, choose O&M > Alarm > Alarms.
  2. View location information of this alarm and check whether QueueName is root and QueueMetric is Memory or QueueName is root and QueueMetric is vCores.

    • If yes, go to 5.
    • If no, go to 6.

  1. The memory or CPU of the Yarn cluster is insufficient. In this case, log in to the node where NodeManager resides and run the free -g and cat /proc/cpuinfo commands to query the available memory and available CPU of the node, respectively. On FusionInsight Manager, increase the values of yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores for the Yarn NodeManager based on the query results. Then, restart the NodeManager instance. Check whether the alarm is cleared.

    During NodeManager restart, containers submitted to this node may be retried to other nodes.

    • If yes, no further action is required.
    • If no, go to 6.

Checking the maximum resource capacity of a queue.

  1. View location information of this alarm and check whether QueueName is <Tenant Queue> and QueueMetric is Memory, or QueueName is <Tenant Queue> and QueueMetric is vCores in Location, check whether available Memory = or available vCores = are included in Additional Information.

    • If yes, go to 7.
    • If no, go to 9.

  1. The memory or CPU of the tenant queue is insufficient. In this case, choose Tenant Resources > Dynamic Resource Plan > Resource Distribution Policy and increase the value of Maximum Capacity. Then, check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 8.

  1. Choose Cluster > Name of the desired cluster > Services > Yarn > Configurations > All Configurations. Enter the keyword "threshold" and click ResourceManager. Adjust the threshold values of the following parameters:

    If Additional Information contains available Memory =, change the value of yarn.queue.memory.alarm.threshold to a value smaller than that of available Memory = in Additional Information.

    If Additional Information contains available vCores =, change the value of yarn.queue.vcore.alarm.threshold to a value smaller than that of available vCores = in Additional Information.

    Wait for five minutes and check whether the alarm is cleared.
    • If yes, no further action is required.
    • If no, go to 11.

Checking the maximum AM resource percentage.

  1. If available AmMemory = or available AmvCores = is included in Additional Information, ApplicationMaster memory or CPU of the tenant queue is insufficient. In this case, choose Tenant Resources > Dynamic Resource Plan > Queue Configuration and increase the value of Maximum Am Resource Percent. Then, check whether this alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 10.

  2. Choose Cluster > Name of the desired cluster > Services > Yarn > Configurations > All Configurations. Enter the keyword "threshold" and click ResourceManager. Adjust the threshold values of the following parameters:

    If Additional Information contains available AmMemory =, change the value of yarn.queue.memory.alarm.threshold to a value smaller than that of available AmMemory = in Additional Information.

    If Additional Information contains available AmvCores =, change the value of yarn.queue.vcore.alarm.threshold to a value smaller than that of available AmvCores = in Additional Information.

    Wait for five minutes and check whether the alarm is cleared.
    • If yes, no further action is required.
    • If no, go to 11.

Collect fault information.

  1. Log in to FusionInsight Manager of the active cluster, and choose O&M > Log > Download.
  2. Select Yarn in the required cluster from the Service.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact the O&M personnel and send the collected logs.

Alarm Clearing

After the fault is rectified, the system automatically clears this alarm.

Reference

None