Help Center/ MapReduce Service/ Troubleshooting/ Using HDFS/ "ALM-12027 Host PID Usage Exceeds the Threshold" Is Generated for a NameNode

Updated on 2022-12-09 GMT+08:00

View PDF

"ALM-12027 Host PID Usage Exceeds the Threshold" Is Generated for a NameNode

Symptom

In cluster 3.1.2 and earlier 3.x versions, alarm "ALM-12027 Host PID Usage Exceeds the Threshold" is generated for a NameNode, and Java processes of the node may report error "unable to create new native thread".

Cause Analysis

Run the following command to collect statistics on the number of threads of node processes and sort the threads:
ps -efT | awk '{print $2}' |sort -n |uniq -c |sort -n

The result is as follows.
Check the process that starts the most threads. In this example, process 2346 is the NameNode process, which starts 54,000 threads and keeps increasing.
The jstack log of that process is printed for multiple times. According to the jstack log information, a large number of NameNode threads are in the WAITING state and have not been released for a long time.

According to the preceding analysis, the NameNode has a built-in mechanism to automatically enable the DEBUG log function based on the WARN log information. In the environment, the DEBUG log function is repeatedly enabled and log4j is continuously modified because the replica fails to be selected. After the log4j of the component is modified, the process automatically loads the configuration file, new threads are automatically generated, and this alarm is triggered after a long period of time.

In this case, disable the built-in mechanism to disable the function of automatically changing the log level.

Procedure

Log in to the active and standby NameNodes in the cluster and run the following commands to back up the script:

cd $BIGDATA_HOME/FusionInsight_Current/*_*_NameNode/install/hadoop/sbin

cp hdfs-namenode-period-check.sh /tmp

Edit the hdfs-namenode-period-check.sh file on the active and standby NameNodes.

vi hdfs-namenode-period-check.sh

Comment out checkBlockplacementLog in the main method. For example:

main()
{
    Log $INFO "start period check"
    checkHaState
    checkDefaultFS
    checkAutoBalancer
    checkFsMonitorDirectory
    checkAutoMover
    checkAutoDatamove
    checkAutoNodeLabelrefresh
    checkJournalNodeSync
    checkCheckpoint
    checkCleanAcls
    checkSssdMonitor
    checkOperationCollecter
    checkMapReduceDistributedCache
    #checkBlockplacementLog
    checkAutoDiskBalancer
}

Save the file, log in to Manager, and choose Cluster > Services > HDFS > Instance. Select all NameNode instances, click More, and choose Restart Instance.