Help Center/ MapReduce Service/ Troubleshooting/ Using HDFS/ "ALM-12027 Host PID Usage Exceeds the Threshold" Is Generated for a NameNode
Updated on 2023-01-11 GMT+08:00

"ALM-12027 Host PID Usage Exceeds the Threshold" Is Generated for a NameNode

Symptom

In cluster 3.1.2 and earlier 3.x versions, alarm "ALM-12027 Host PID Usage Exceeds the Threshold" is generated for a NameNode, and Java processes of the node may report error "unable to create new native thread".

Cause Analysis

  1. Run the following command to collect statistics on the number of threads of node processes and sort the threads:

    ps -efT | awk '{print $2}' |sort -n |uniq -c |sort -n

    The result is as follows.

  2. Check the process that starts the most threads. In this example, process 2346 is the NameNode process, which starts 54,000 threads and keeps increasing.
  3. The jstack log of that process is printed for multiple times. According to the jstack log information, a large number of NameNode threads are in the WAITING state and have not been released for a long time.

According to the preceding analysis, the NameNode has a built-in mechanism to automatically enable the DEBUG log function based on the WARN log information. In the environment, the DEBUG log function is repeatedly enabled and log4j is continuously modified because the replica fails to be selected. After the log4j of the component is modified, the process automatically loads the configuration file, new threads are automatically generated, and this alarm is triggered after a long period of time.

In this case, disable the built-in mechanism to disable the function of automatically changing the log level.

Procedure

  1. Log in to the active and standby NameNodes in the cluster and run the following commands to back up the script:

    cd $BIGDATA_HOME/FusionInsight_Current/*_*_NameNode/install/hadoop/sbin

    cp hdfs-namenode-period-check.sh /tmp

  2. Edit the hdfs-namenode-period-check.sh file on the active and standby NameNodes.

    vi hdfs-namenode-period-check.sh

    Comment out checkBlockplacementLog in the main method. For example:

    main()
    {
        Log $INFO "start period check"
        checkHaState
        checkDefaultFS
        checkAutoBalancer
        checkFsMonitorDirectory
        checkAutoMover
        checkAutoDatamove
        checkAutoNodeLabelrefresh
        checkJournalNodeSync
        checkCheckpoint
        checkCleanAcls
        checkSssdMonitor
        checkOperationCollecter
        checkMapReduceDistributedCache
        #checkBlockplacementLog
        checkAutoDiskBalancer
    }

  3. Save the file, log in to Manager, and choose Cluster > Services > HDFS > Instance. Select all NameNode instances, click More, and choose Restart Instance.