Processes Are Terminated Unexpectedly

A DataNode is restarted unexpectedly, but no manual restart operation is performed for the DataNode.

Possible causes:

OOM of the Java process is terminated.
In general, the OMM Killer is configured for Java processes to detect and kill OOM. The OOM log is printed in the out log. In this case, you can view the run log (for example, the DataNode's log path is /var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-hostname.log) to check whether OutOfMemory is printed.

The process is terminated by another process or manually terminated.
Check the DataNode run log file /var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-hostname.log. It is found that the health check fails after "RECEIVED SIGNAL 15" is received.
In the following example, the DataNode is terminated at 11:04:48 and then started at 11:06:52 two minutes later.
```
2018-12-06 11:04:48,433 | ERROR | SIGTERM handler | RECEIVED SIGNAL 15: SIGTERM | LogAdapter.java:69
2018-12-06 11:04:48,436 | INFO  | Thread-1 | SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at 192-168-235-85/192.168.235.85
************************************************************/ | LogAdapter.java:45
2018-12-06 11:06:52,744 | INFO  | main | STARTUP_MSG:
```
According to the logs, DataNode was closed and then the health check reported the exception. After 2 minutes, NodeAgent started the DataNode process.

Add the rule for recording the kill command in the audit log of the OS. The process that delivers the command will be recorded in the audit log.

Operation impact

Printing audit logs affects operating system performance. However, analysis result shows that the impact is less than 1%.
Printing audit log occupies some disk space. The logs to be printed are within megabytes. By default, the aging mechanism and the mechanism for checking the remaining disk space are configured. Therefore, the disk space will not be used up.

Locating Method

Perform the following operations on nodes that may restart the DataNode process:

Log in to the node as the root user and run the service auditd status command to check the service status.
```
Checking for service auditd  running
```
If the service is not started, run the service auditd restart command to restart the service. The command execution takes less than 1 second and has no impact on the system.
```
Shutting down auditd done
Starting auditd done
```
The audit rule of the kill command is temporarily added to audit logs.

Add an audit rule:

auditctl -a exit,always -F arch=b64 -S kill -S tkill -S tgkill -F a1!=0 -k process_killed

View the rule:

auditctl -l
If a process is terminated due to an exception, run the ausearch -k process_killed command to query the termination history.

a0 is the PID (hexadecimal) of the process that is terminated, and a1 is the semaphore of the kill command.

Verification

Restart an instance of the node on MRS Manager, for example, DataNode.
Run the ausearch -k process_killed command to check whether logs are printed.

The following is an example of the ausearch -k process_killed |grep ".sh" command. The command output indicates that the hdfs-daemon-ada* script closed the DataNode process.

Stop auditing the kill command.

Run the service auditd restart command. The temporarily added kill command audit logs are cleared automatically.
Run the auditctl -l command. If no related information is displayed, the rule is cleared successfully.

Parent topic: Cluster Management