ALM-18013 ResourceManager Direct Memory Usage Exceeds the Threshold

Alarm Description

The system checks the direct memory usage of ResourceManager every 30 seconds. This alarm is generated when the direct memory usage of ResourceManager instances exceeds the threshold (90% of the maximum memory).

This alarm is automatically cleared when the direct memory usage is less than the threshold.

Alarm Attributes

Alarm ID	Alarm Severity	Auto Cleared
18013	Major	Yes

Alarm Parameters

Parameter	Description
Source	Specifies the cluster for which the alarm was generated.
ServiceName	Specifies the service for which the alarm was generated.
RoleName	Specifies the role for which the alarm was generated.
HostName	Specifies the host for which the alarm was generated.
Trigger Condition	Specifies the threshold for triggering the alarm.

Impact on the System

If the available direct memory of ResourceManager is insufficient, a memory overflow occurs and the service breaks down.

Possible Causes

The direct memory of ResourceManager instances is overused or the direct memory is inappropriately allocated.

Handling Procedure

Check the direct memory usage.

On FusionInsight Manager, choose O&M > Alarm > Alarms > ALM-18013 ResourceManager Direct Memory Usage Exceeds the Threshold > Location. View the IP address of the instance for which the alarm is generated.
On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > Yarn. On the page that is displayed, click the Instances tab and click the ResourceManager instance for which this alarm is generated. Click the drop-down list in the upper right corner of the chart area, choose Customize > Resource, and select Memory Usage Status of ResourceManager to check the direct memory usage.

Figure 1 Customizing ResourceManager memory usage details
Check whether the used direct memory of a ResourceManager instance reaches 90% (default threshold) of the maximum direct memory allocated to it.
- If yes, go to 4.
- If no, go to 9.
On FusionInsight Manager, choose Cluster, click the name of the desired cluster, and choose Services > Yarn > Configurations > All Configurations > ResourceManager > System. Check whether -XX:MaxDirectMemorySize exists in the GC_OPTS parameter.
- If yes, go to 5.
- If no, go to 7.
Delete the -XX:MaxDirectMemorySize parameter from GC_OPTS and save the configuration.

MaxDirectMemorySize indicates the maximum off-heap memory size. If the MaxDirectMemorySize parameter of ResourceManager is not specified, the memory of ResourceManager is not limited. By default, -XX:MaxDirectMemorySize in the GC_OPTS parameter is not set.
Perform the following steps to restart the ResourceManager instance:
- Restarting the standby ResourceManager instance does not affect services.
- During the ResourceManager switchover, new jobs cannot be submitted to Yarn, but submitted jobs are not affected.
1. On the Yarn service page, click the Instances tab, select the ResourceManager (Standby) instance, choose More, select Restart Instance, and verify the password to restart the instance.
2. After the standby instance is restarted, click the Dashboard tab of Yarn, choose More, select Perform ResourceManager Switchover, and verify the password to perform an active/standby switchover.
3. After the active/standby switchover is complete, click the Instances tab on the Yarn service page, select the ResourceManager (Standby) instance, choose More, select Restart Instance, and verify the password to restart the instance. Wait until the instance is restarted.
Check whether ALM-18008 Heap Memory Usage of ResourceManager Exceeds the Threshold exists.
- If yes, rectify the fault by referring to ALM-18008 Heap Memory Usage of ResourceManager Exceeds the Threshold.
- If no, go to 8.
Check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 9.

Collect fault information.

On FusionInsight Manager, choose O&M. In the navigation pane on the left, choose Log > Download.
Expand the Service drop-down list, and select ResourceManager for the target cluster.
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact O&M personnel and provide the collected logs.

Alarm Clearance

This alarm is automatically cleared after the fault is rectified.

Related Information

None

Parent topic: MRS Cluster Alarm Handling Reference

Previous topic: ALM-18012 JobHistoryServer GC Time Exceeds the Threshold

Next topic: ALM-18014 NodeManager Direct Memory Usage Exceeds the Threshold

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.

The system is busy. Please try again later.

Which of the following issues have you encountered?

Content is inconsistent with the product UI

Unclear descriptions

Lack of examples or code

Incorrect steps

Can't find what I need

Lack of best practices

Feedback (optional)

0/500

Select at least one type of issue, and enter your comments or suggestions.

Enter a maximum of 500 characters.

Submit Cancel

For any further questions, feel free to contact us through the chatbot.

Chatbot