"ALM-12027 Host PID Usage Exceeds the Threshold" Is Generated for a NameNode

Symptom

In cluster 3.1.2 and earlier 3.x versions, alarm "ALM-12027 Host PID Usage Exceeds the Threshold" is generated for a NameNode, and Java processes of the node may report error "unable to create new native thread".

Cause Analysis

Run the following command to collect statistics on the number of threads of node processes and sort the threads:
ps -efT | awk '{print $2}' |sort -n |uniq -c |sort -n

The result is as follows.
Check the process that starts the most threads. In this example, process 2346 is the NameNode process, which starts 54,000 threads and keeps increasing.
The jstack log of that process is printed for multiple times. According to the jstack log information, a large number of NameNode threads are in the WAITING state and have not been released for a long time.

In conclusion, the NameNode has a built-in mechanism to automatically enable the DEBUG log function based on the WARN log information. In the environment, the DEBUG log function is repeatedly enabled and log4j is continuously modified because the replica fails to be selected. After the log4j of the component is modified, the configuration file is automatically loaded, new threads are automatically generated, and this alarm is triggered over time.

In this case, disable the built-in mechanism to disable the function of automatically changing the log level.

Procedure

Log in to the active and standby NameNodes in the cluster and run the following commands to back up the script:

cd $BIGDATA_HOME/FusionInsight_Current/*_*_NameNode/install/hadoop/sbin

cp hdfs-namenode-period-check.sh /tmp

Edit the hdfs-namenode-period-check.sh file on the active and standby NameNodes.

vi hdfs-namenode-period-check.sh

Comment out checkBlockplacementLog in the main method. For example:

main()
{
    Log $INFO "start period check"
    checkHaState
    checkDefaultFS
    checkAutoBalancer
    checkFsMonitorDirectory
    checkAutoMover
    checkAutoDatamove
    checkAutoNodeLabelrefresh
    checkJournalNodeSync
    checkCheckpoint
    checkCleanAcls
    checkSssdMonitor
    checkOperationCollecter
    checkMapReduceDistributedCache
    #checkBlockplacementLog
    checkAutoDiskBalancer
}

Save the file, log in to Manager, and choose Cluster > Services > HDFS > Instance. Select all NameNode instances, click More, and choose Restart Instance.

Parent topic: Using HDFS

Previous topic: HDFS Displays Insufficient Disk Space But 10% Disk Space Remains

Next topic: ALM-14012 JournalNode Is Out of Synchronization Is Generated in the Cluster

Feedback

Was this page helpful?

Helpful Not helpful

Provide feedback

Thank you very much for your feedback. We will continue working to improve the documentation.

The system is busy. Please try again later.