"ALM-12027 Host PID Usage Exceeds the Threshold" Is Generated for a NameNode
Symptom
In cluster 3.1.2 and earlier 3.x versions, alarm "ALM-12027 Host PID Usage Exceeds the Threshold" is generated for a NameNode, and Java processes of the node may report error "unable to create new native thread".
Cause Analysis
- Run the following command to collect statistics on the number of threads of node processes and sort the threads:
ps -efT | awk '{print $2}' |sort -n |uniq -c |sort -n
The result is as follows.
- Check the process that starts the most threads. In this example, process 2346 is the NameNode process, which starts 54,000 threads and keeps increasing.
- The jstack log of that process is printed for multiple times. According to the jstack log information, a large number of NameNode threads are in the WAITING state and have not been released for a long time.
According to the preceding analysis, the NameNode has a built-in mechanism to automatically enable the DEBUG log function based on the WARN log information. In the environment, the DEBUG log function is repeatedly enabled and log4j is continuously modified because the replica fails to be selected. After the log4j of the component is modified, the process automatically loads the configuration file, new threads are automatically generated, and this alarm is triggered after a long period of time.
In this case, disable the built-in mechanism to disable the function of automatically changing the log level.
Procedure
- Log in to the active and standby NameNodes in the cluster and run the following commands to back up the script:
cd $BIGDATA_HOME/FusionInsight_Current/*_*_NameNode/install/hadoop/sbin
cp hdfs-namenode-period-check.sh /tmp
- Edit the hdfs-namenode-period-check.sh file on the active and standby NameNodes.
vi hdfs-namenode-period-check.sh
Comment out checkBlockplacementLog in the main method. For example:
main() { Log $INFO "start period check" checkHaState checkDefaultFS checkAutoBalancer checkFsMonitorDirectory checkAutoMover checkAutoDatamove checkAutoNodeLabelrefresh checkJournalNodeSync checkCheckpoint checkCleanAcls checkSssdMonitor checkOperationCollecter checkMapReduceDistributedCache #checkBlockplacementLog checkAutoDiskBalancer }
- Save the file, log in to Manager, and choose Cluster > Services > HDFS > Instance. Select all NameNode instances, click More, and choose Restart Instance.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.