Help Center/ MapReduce Service/ User Guide/ MRS Cluster O&M/ MRS Cluster Alarm Handling Reference/ ALM-18013 ResourceManager Direct Memory Usage Exceeds the Threshold
Updated on 2024-09-23 GMT+08:00

ALM-18013 ResourceManager Direct Memory Usage Exceeds the Threshold

Alarm Description

The system checks the direct memory usage of ResourceManager every 30 seconds. This alarm is generated when the direct memory usage of ResourceManager instances exceeds the threshold (90% of the maximum memory).

This alarm is automatically cleared when the direct memory usage is less than the threshold.

Alarm Attributes

Alarm ID

Alarm Severity

Auto Cleared

18013

Major

Yes

Alarm Parameters

Parameter

Description

Source

Specifies the cluster for which the alarm was generated.

ServiceName

Specifies the service for which the alarm was generated.

RoleName

Specifies the role for which the alarm was generated.

HostName

Specifies the host for which the alarm was generated.

Trigger Condition

Specifies the threshold for triggering the alarm.

Impact on the System

If the available direct memory of ResourceManager is insufficient, a memory overflow occurs and the service breaks down.

Possible Causes

The direct memory of ResourceManager instances is overused or the direct memory is inappropriately allocated.

Handling Procedure

Check the direct memory usage.

  1. On FusionInsight Manager, choose O&M > Alarm > Alarms > ALM-18013 ResourceManager Direct Memory Usage Exceeds the Threshold > Location. View the IP address of the instance for which the alarm is generated.
  2. On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > Yarn. On the page that is displayed, click the Instances tab and click the ResourceManager instance for which this alarm is generated. Click the drop-down list in the upper right corner of the chart area, choose Customize > Resource, and select Memory Usage Status of ResourceManager to check the direct memory usage.

    Figure 1 Customizing ResourceManager memory usage details

  3. Check whether the used direct memory of a ResourceManager instance reaches 90% (default threshold) of the maximum direct memory allocated to it.

    • If yes, go to 4.
    • If no, go to 9.

  4. On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > Yarn > Configurations > All Configurations > ResourceManager > System. Check whether -XX:MaxDirectMemorySize exists in the GC_OPTS parameter.

    • If yes, go to 5.
    • If no, go to 7.

  5. Delete the -XX:MaxDirectMemorySize parameter from GC_OPTS and save the configuration.

    MaxDirectMemorySize indicates the maximum off-heap memory size. If the MaxDirectMemorySize parameter of ResourceManager is not specified, the memory of ResourceManager is not limited. By default, -XX:MaxDirectMemorySize in the GC_OPTS parameter is not set.

  6. Perform the following steps to restart the ResourceManager instance:

    • Restarting the standby ResourceManager instance does not affect services.
    • During the ResourceManager switchover, new jobs cannot be submitted to Yarn, but submitted jobs are not affected.
    1. On the Yarn service page, click the Instances tab, select the ResourceManager (Standby) instance, choose More, select Restart Instance, and verify the password to restart the instance.
    2. After the standby instance is restarted, click the Dashboard tab of Yarn, choose More, select Perform ResourceManager Switchover, and verify the password to perform an active/standby switchover.
    3. After the active/standby switchover is complete, click the Instances tab on the Yarn service page, select the ResourceManager (Standby) instance, choose More, select Restart Instance, and verify the password to restart the instance. Wait until the instance is restarted.

  7. Check whether ALM-18008 Heap Memory Usage of ResourceManager Exceeds the Threshold exists.

    • If yes, rectify the fault by referring to ALM-18008 Heap Memory Usage of ResourceManager Exceeds the Threshold.
    • If no, go to 8.

  8. Check whether the alarm is cleared.

    • If yes, no further action is required.
    • If no, go to 9.

Collect fault information.

  1. On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
  2. Expand the Service drop-down list, and select ResourceManager for the target cluster.
  3. Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
  4. Contact O&M personnel and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None