DataNode Restarts Unexpectedly
Symptom
A DataNode is restarted unexpectedly, but no manual restart operation is performed for the DataNode.
Cause Analysis
Possible causes:
- OOM of the Java process is killed.
In general, the OMM Killer is configured for Java processes to detect and kill OOM. The OOM log is printed in the out log. In this case, you can view the run log (for example, the DataNode's log path is /var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-hostname.log) to check whether OutOfMemory is printed.
- DataNode is manually killed or killed by another process.
Check the DataNode run log file /var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-hostname.log. It is found that the health check fails after "RECEIVED SIGNAL 15" is received. In the following example, the DataNode is killed at 11:04:48 and then started at 11:06:52 two minutes later.
2018-12-06 11:04:48,433 | ERROR | SIGTERM handler | RECEIVED SIGNAL 15: SIGTERM | LogAdapter.java:69 2018-12-06 11:04:48,436 | INFO | Thread-1 | SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down DataNode at 192-168-235-85/192.168.235.85 ************************************************************/ | LogAdapter.java:45 2018-12-06 11:06:52,744 | INFO | main | STARTUP_MSG:
According to the logs, DataNode was closed and then the health check reported the exception. After 2 minutes, NodeAgent started the DataNode process.
Procedure
Add the rule for recording the kill command in the audit log of the operating system. The process that delivers the kill command will be recorded in the audit log.
Operation impact
- Printing audit logs affects operating system performance. However, analysis result shows that the impact is less than 1%.
- Printing audit log occupies some disk space. The logs to be printed are within megabytes. By default, the aging mechanism and the mechanism for checking the remaining disk space are configured. Therefore, the disk space will not be used up.
Locating Method
Perform the following operations on nodes that may restart the DataNode process:
- Log in to the node as the root user and run the service auditd status command to check the service status.
Checking for service auditd running
If the service is not started, run the service auditd restart command to restart the service. The command execution takes less than 1 second and has no impact on the system.
Shutting down auditd done Starting auditd done
- The audit rule of the kill command is temporarily added to audit logs.
Add an audit rule:
auditctl -a exit,always -F arch=b64 -S kill -S tkill -S tgkill -F a1!=0 -k process_killed
View the rule:
auditctl -l
- If a process is killed due to an exception, you can run the ausearch -k process_killed command to query the kill history.
a0 is the PID (hexadecimal) of the process that is killed, and a1 is the semaphore of the kill command.
Verification
- Restart an instance of the node on MRS Manager, for example, DataNode.
- Run the ausearch -k process_killed command to check whether logs are printed.
The following is an example of the ausearch -k process_killed |grep ".sh" command. The command output indicates that the hdfs-daemon-ada* script closed the DataNode process.
Stop auditing the kill command.
- Run the service auditd restart command. The temporarily added kill command audit logs are cleared automatically.
- Run the auditctl -l command. If no information about killing a process is returned, the rule is cleared successfully.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot