ALM-18003 NodeManager Unhealthy
Alarm Description
The system checks the number of unhealthy NodeManager nodes every 30 seconds, and compares the number with the threshold. The Unhealthy Nodes indicator has a default threshold. This alarm is generated when the value of the Unhealthy Nodes indicator exceeds the threshold.
To change the threshold, on FusionInsight Manager, choose Cluster > Name of the desired cluster > Services > Yarn. On the displayed page, choose Configurations > All Configurations, and change the value of yarn.nodemanager.unhealthy.alarm.threshold. You do not need to restart Yarn to make the change take effect.
The default threshold is 0. The alarm is generated when the number of unhealthy nodes exceeds the threshold, and is cleared when the number of unhealthy nodes is less than the threshold.
Alarm Attributes
Alarm ID |
Alarm Severity |
Alarm Type |
Service Type |
Auto Cleared |
---|---|---|---|---|
18003 |
Major |
Error handling |
Yarn |
Yes |
Alarm Parameters
Type |
Parameter |
Description |
---|---|---|
Location Information |
Source |
Specifies the cluster for which the alarm is generated. |
ServiceName |
Specifies the service for which the alarm is generated. |
|
RoleName |
Specifies the role for which the alarm is generated. |
|
HostName |
Specifies the host for which the alarm is generated. |
|
Additional Information |
Unhealthy Host |
Specifies the list of hosts with unhealthy nodes. |
Impact on the System
- The faulty NodeManager node cannot provide the Yarn service.
- The number of containers decreases, so the cluster performance deteriorates.
Possible Causes
- The hard disk space of the host where the NodeManager node resides is insufficient.
- User omm does not have the permission to access a local directory on the NodeManager node.
Handling Procedure
Check the hard disk space of the host.
- On the FusionInsight Manager, and choose O&M > Alarm > Alarms. Click before the alarm and obtain unhealthy nodes in Additional Information.
- Choose Cluster > Name of the desired cluster > Services > Yarn > Instance, select the NodeManager instance corresponding to the host, choose Instance Configurations > All Configurations and view disks corresponding to yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs.
- Choose O&M > Alarm > Alarms. In the alarm list, check whether the related disk has the alarm ALM-12017 Insufficient Disk Capacity.
- Rectify the disk fault based on ALM-12017 Insufficient Disk Capacit and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 7.
- Choose Hosts > Name of the desired host . On the Dashboard page, check the disk usage of the corresponding partition. Check whether the percentage of the used space of the mounted disk exceeds the value of yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
- Reduce the disk usage to less than the value of yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage, wait for 10 to 20 minutes, and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 7.
Check the access permission of the local directory on each NodeManager node.
- Obtain the NodeManager directory viewed in 2, log in to each NodeManager node as user root, and go to the obtained directory.
- Run the ll command to check whether the permission of the localdir and containerlogs folders is 755 and whether User:Group is omm:ficommon.
- If yes, no further action is required.
- If no, go to 9.
- Run the following command to set the permission to 755 and User:Group to omm:ficommon:
chmod 755 <folder_name>
chown omm:ficommon <folder_name>
- Wait for 10 to 20 minutes and check whether the alarm is cleared.
- If yes, no further action is required.
- If no, go to 11.
Collect fault information.
- On the FusionInsight Manager in the active cluster, choose O&M > Log > Download.
- Select Yarn in the required cluster from the Service.
- Click in the upper right corner, and set Start Date and End Date for log collection to 10 minutes ahead of and after the alarm generation time, respectively. Then, click Download.
- Contact the O&M engineers and send the collected logs.
Alarm Clearance
After the fault is rectified, the system automatically clears this alarm.
Related Information
None.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot