Manually Performing Checkpoints When a NameNode Is Faulty for a Long Time
Symptom
If the standby NameNode is faulty for a long time, a large amount of editlog will be accumulated. In this case, if the HDFS or active NameNode is restarted, the active NameNode reads a large amount of unmerged editlog. As a result, the HDFS or active NameNode takes a long time to restart and even fails to restart.
Cause Analysis
The standby NameNode periodically combines editlog files and generates the fsimage file. This process is called checkpoint. After the fsimage file is generated, the standby NameNode transfers it to the active NameNode.
As the standby NameNode periodically combines editlog files, it cannot combine them when it becomes abnormal. As a result, the active NameNode needs to load many editlog files during its next startup, which occupies much memory and takes a long time.
The period of metadata combination is determined by the following parameters. If the NameNode runs for 30 minutes or one million counts of operations are performed on HDFS, the checkpoint is implemented.
- dfs.namenode.checkpoint.period: specifies the checkpoint period. The default value is 1800s.
- dfs.namenode.checkpoint.txns: specifies the times of operations for triggering the checkpoint execution. The default value is 1000000.
Solution
Before restarting the HDFS or active NameNode, perform checkpoint manually to merge metadata of the active NameNode.
- Stop workloads.
- Obtain the hostname of the active NameNode.
- Run the following commands on the client:
source /opt/client/bigdata_env
kinit Component user
Note: Replace /opt/client with the actual installation path of the client.
- Run the following command to enable the safe mode for the active NameNode (replace linux22 with the hostname of the active NameNode):
hdfs dfsadmin -fs linux22:25000 -safemode enter
- Run the following command to merge editlog on the active NameNode:
hdfs dfsadmin -fs linux22:25000 -saveNamespace
- Run the following command to make the active NameNode exit the safe mode:
hdfs dfsadmin -fs linux22:25000 -safemode leave
- Check whether the combination is complete.
cd /srv/BigData/namenode/current
Check whether the time of the first generated fsimage is the current time. If yes, the combination is complete.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot