HDFS Failed to Start Due to Insufficient Memory

Symptom

After the HDFS service is restarted, HDFS is in the Bad state, the NameNode instance status is abnormal, and the system cannot exit the security mode for a long time.

Cause Analysis

In the NameNode run log (/var/log/Bigdata/hdfs/nn/hadoop-omm-namendoe-XXX.log), search for WARN. It is found that GC takes 63 seconds.

2017-01-22 14:52:32,641 | WARN  | org.apache.hadoop.util.JvmPauseMonitor$Monitor@1b39fd82 | Detected pause in JVM or host machine (eg GC): pause of approximately 63750ms
GC pool 'ParNew' had collection(s): count=1 time=0ms
GC pool 'ConcurrentMarkSweep' had collection(s): count=1 time=63924ms | JvmPauseMonitor.java:189

Analyze the NameNode log /var/log/Bigdata/hdfs/nn/hadoop-omm-namendoe-XXX.log. It is found that the NameNode is waiting for block reporting and the total number of blocks is too large. In the following example, the total number of blocks is 36.29 million.
```
2017-01-22 14:52:32,641 | INFO  | IPC Server handler 8 on 25000 | STATE* Safe mode ON. 
The reported blocks 29715437 needs additional 6542184 blocks to reach the threshold 0.9990 of total blocks 36293915.
```
On Manager, check the GC_OPTS parameter of the NameNode:
Figure 1 Checking the GC_OPTS parameter of the NameNode

For details about the mapping between the NameNode memory configuration and data volume, see Table 1.

**Table 1** Mapping between NameNode memory configuration and data volume
Number of File Objects	Reference Value
10,000,000	-Xms6G -Xmx6G -XX:NewSize=512M -XX:MaxNewSize=512M
20,000,000	-Xms12G -Xmx12G -XX:NewSize=1G -XX:MaxNewSize=1G
50,000,000	-Xms32G -Xmx32G -XX:NewSize=2G -XX:MaxNewSize=3G
100,000,000	-Xms64G -Xmx64G -XX:NewSize=4G -XX:MaxNewSize=6G
200,000,000	-Xms96G -Xmx96G -XX:NewSize=8G -XX:MaxNewSize=9G
300,000,000	-Xms164G -Xmx164G -XX:NewSize=12G -XX:MaxNewSize=12G

Solution

Modify the NameNode memory parameter based on the specifications. If the number of blocks is 36 million, change the parameter value to -Xms32G -Xmx32G -XX:NewSize=2G -XX:MaxNewSize=3G.
Restart a NameNode and check that the NameNode can be started normally.
Restart the other NameNode and check that the page status is restored.

Parent topic: Using HDFS

Previous topic: A DataNode of HDFS Is Always in the Decommissioning State

Next topic: A Large Number of Blocks Are Lost in HDFS due to the Time Change Using ntpdate