ALM-18003 NodeManager Unhealthy

Description

The system checks the number of unhealthy NodeManager nodes every 30 seconds, and compares the number with the threshold. The Unhealthy Nodes indicator has a default threshold. This alarm is generated when the value of the Unhealthy Nodes indicator exceeds the threshold.

To change the threshold, on FusionInsight Manager, choose Cluster > Name of the desired cluster > Services > Yarn. On the displayed page, choose Configurations > All Configurations, and change the value of yarn.nodemanager.unhealthy.alarm.threshold. You do not need to restart Yarn to make the change take effect.

The default threshold is 0. The alarm is generated when the number of unhealthy nodes exceeds the threshold, and is cleared when the number of unhealthy nodes is less than the threshold.

Attribute

Alarm ID	Alarm Severity	Automatically Cleared
18003	Major	Yes

Parameters

Name	Meaning
Source	Specifies the cluster for which the alarm is generated.
ServiceName	Specifies the service for which the alarm is generated.
RoleName	Specifies the role for which the alarm is generated.
HostName	Specifies the host for which the alarm is generated.
Unhealthy Host	Specifies the list of hosts with unhealthy nodes.

Impact on the System

The faulty NodeManager node cannot provide the Yarn service.
The number of containers decreases, so the cluster performance deteriorates.

Possible Causes

The hard disk space of the host where the NodeManager node resides is insufficient.
User omm does not have the permission to access a local directory on the NodeManager node.

Procedure

Check the hard disk space of the host.

On the FusionInsight Manager, and choose O&M > Alarm > Alarms. Click before the alarm and obtain unhealthy nodes in Additional Information.
Choose Cluster > Name of the desired cluster > Services > Yarn > Instance, select the NodeManager instance corresponding to the host, choose Instance Configurations > All Configurations and view disks corresponding to yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs.
Choose O&M > Alarm > Alarms. In the alarm list, check whether the related disk has the alarm ALM-12017 Insufficient Disk Capacity.
- If yes, go to 4.
- If no, go to 5.
Rectify the disk fault based on ALM-12017 Insufficient Disk Capacit and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 7.
Choose Hosts > Name of the desired host . On the Dashboard page, check the disk usage of the corresponding partition. Check whether the percentage of the used space of the mounted disk exceeds the value of yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
- If yes, go to 6.
- If no, go to 7.
Reduce the disk usage to less than the value of yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage, wait for 10 to 20 minutes, and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 7.

Check the access permission of the local directory on each NodeManager node.

Obtain the NodeManager directory viewed in 2, log in to each NodeManager node as user root, and go to the obtained directory.
Run the ll command to check whether the permission of the localdir and containerlogs folders is 755 and whether User:Group is omm:ficommon.
- If yes, no further action is required.
- If no, go to 9.
Run the following command to set the permission to 755 and User:Group to omm:ficommon:

chmod 755 <folder_name>

chown omm:ficommon <folder_name>
Wait for 10 to 20 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 11.

Collect fault information.

On the FusionInsight Manager in the active cluster, choose O&M > Log > Download.
Select Yarn in the required cluster from the Service.
Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
Contact the O&M personnel and send the collected logs.