Help Center/ MapReduce Service/ Troubleshooting/ Using HDFS/ A DataNode of HDFS Is Always in the Decommissioning State
Updated on 2024-12-18 GMT+08:00

A DataNode of HDFS Is Always in the Decommissioning State

Issue

A DataNode of HDFS is in the Decommissioning state for a long period of time.

Symptom

A DataNode of HDFS fails to be decommissioned (or the Core node fails to be scaled in), but the DataNode remains in the Decommissioning state.

Cause Analysis

During the decommissioning of a DataNode (or scale-in of the Core node) in HDFS, the decommissioning or scale-in task fails and the blacklist is not cleared because the Master node is restarted or the NodeAgent process exits unexpectedly. In this case, the DataNode remains in the Decommissioning state. The blacklist needs to be cleared manually.

Procedure

  1. Go to the service instance page.

    MRS Manager:

    Log in to MRS Manager and choose Services > HDFS > Instance.

    FusionInsight Manager:

    MRS 3.x or later: Log in to FusionInsight Manager and choose Cluster > Service > HDFS > Instance.

    Log in to the MRS console and choose Components > HDFS > Instances.

  2. Check the HDFS service instance status, locate the DataNode that is in the decommissioning state, and copy the IP address of the DataNode.
  3. Log in to the Master1 node and run the cd ${BIGDATA_HOME}/MRS_*/1_*_NameNode/etc/ command to go to the blacklist directory.
  4. Run the sed -i "/^IP$/d" excludeHosts command to clear the faulty DataNode information from the blacklist. Replace the IP address in the command with the IP address of the faulty DataNode queried in 2. The IP address cannot contain spaces.
  5. If there are two Master nodes, perform 3 and 4 on Master2.
  6. Run the following command on the Master1 node to initialize environment variables:

    source Client installation directory/bigdata_env

  7. If Kerberos authentication is enabled for the current cluster, run the following command to authenticate the user. If Kerberos authentication is disabled for the current cluster, skip this step:

    kinit Service user who has the HDFS operation permission

  8. Run the following command on the Master1 node to update the HDFS blacklist:

    hdfs dfsadmin -refreshNodes

  9. Run the hdfs dfsadmin -report command to check the status of each DataNode. Ensure that the DataNode corresponding to the IP address obtained has been restored to the Normal state.

    Figure 1 DataNode status

  10. Go to the service instance page.

    MRS Manager:

    Log in to MRS Manager and choose Services > HDFS > Instances.

    FusionInsight Manager:

    MRS 3.x or later: Log in to FusionInsight Manager and choose Cluster > Service > HDFS > Instance.

    Log in to the MRS console and choose Components > HDFS > Instances.

  11. Select the DataNode instance that is in the decommissioning state and choose More > Restart Instance.
  12. Wait until the restart is complete and check whether the DataNode is restored.

Summary and Suggestions

Do not perform high-risk operations, such as restarting nodes, during decommissioning (or scale-in).

Related Information

None