Updated on 2022-12-14 GMT+08:00

DataNode Restarts Unexpectedly

Symptom

A DataNode is restarted unexpectedly, but no manual restart operation is performed for the DataNode.

Cause Analysis

Possible causes:

  • OOM of the Java process is killed.

    In general, the OMM Killer is configured for Java processes to detect and kill OOM. The OOM log is printed in the out log. In this case, you can view the run log (for example, the DataNode's log path is /var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-hostname.log) to check whether OutOfMemory is printed.

  • DataNode is manually killed or killed by another process.
    Check the DataNode run log file /var/log/Bigdata/hdfs/dn/hadoop-omm-datanode-hostname.log. It is found that the health check fails after "RECEIVED SIGNAL 15" is received. In the following example, the DataNode is killed at 11:04:48 and then started at 11:06:52 two minutes later.
    2018-12-06 11:04:48,433 | ERROR | SIGTERM handler | RECEIVED SIGNAL 15: SIGTERM | LogAdapter.java:69
    2018-12-06 11:04:48,436 | INFO  | Thread-1 | SHUTDOWN_MSG:
    /************************************************************
    SHUTDOWN_MSG: Shutting down DataNode at 192-168-235-85/192.168.235.85
    ************************************************************/ | LogAdapter.java:45
    2018-12-06 11:06:52,744 | INFO  | main | STARTUP_MSG:

    According to the logs, DataNode was closed and then the health check reported the exception. After 2 minutes, NodeAgent started the DataNode process.

Procedure

Add the rule for recording the kill command in the audit log of the operating system. The process that delivers the kill command will be recorded in the audit log.

Operation impact

  • Printing audit logs affects operating system performance. However, analysis result shows that the impact is less than 1%.
  • Printing audit log occupies some disk space. The logs to be printed are within megabytes. By default, the aging mechanism and the mechanism for checking the remaining disk space are configured. Therefore, the disk space will not be used up.

Locating Method

Perform the following operations on nodes that may restart the DataNode process:

  1. Log in to the node as the root user and run the service auditd status command to check the service status.

    Checking for service auditd  running

    If the service is not started, run the service auditd restart command to restart the service. The command execution takes less than 1 second and has no impact on the system.

    Shutting down auditd done
    Starting auditd done

  2. The audit rule of the kill command is temporarily added to audit logs.

    Add an audit rule:

    auditctl -a exit,always -F arch=b64 -S kill -S tkill -S tgkill -F a1!=0 -k process_killed

    View the rule:

    auditctl -l

  3. If a process is killed due to an exception, you can run the ausearch -k process_killed command to query the kill history.

    a0 is the PID (hexadecimal) of the process that is killed, and a1 is the semaphore of the kill command.

Verification

  1. Restart an instance of the node on MRS Manager, for example, DataNode.
  2. Run the ausearch -k process_killed command to check whether logs are printed.

    The following is an example of the ausearch -k process_killed |grep ".sh" command. The command output indicates that the hdfs-daemon-ada* script closed the DataNode process.

Stop auditing the kill command.

  1. Run the service auditd restart command. The temporarily added kill command audit logs are cleared automatically.
  2. Run the auditctl -l command. If no information about killing a process is returned, the rule is cleared successfully.