DataNode Fails to Be Started When the Number of Disks Defined in dfs.datanode.data.dir Equals the Value of dfs.datanode.failed.volumes.tolerated
Symptom
DataNode fails to be started when the number of disks defined in dfs.datanode.data.dir (location of DataNode storage blocks in the local file system) equals the value of dfs.datanode.failed.volumes.tolerated (number of volumes allowed to fail before the DataNode stops providing services).
Solution
By default, if a single disk is faulty, the HDFS DataNode process is stopped. As a result, NameNode schedules extra copies for each block stored in DataNode, causing block replication on normal disks.
To prevent this problem, you can configure a DataNodes tolerance value for the dfs.data.dir fault.
- Log in to FusionInsight Manager and choose Cluster > Services > HDFS. Click Configurations and then All Configurations.
- Search for the dfs.datanode.failed.volumes.tolerated parameter.
This parameter specifies the number of volumes allowed to fail before the DataNode stops providing services. By default, there must be at least one valid volume, in which case the value of this parameter is -1. A value greater than or equal to 0 indicates the number of volumes that are allowed to fail. The value ranges from -1 to the number of disk volumes configured on the DataNode.
For example, if this parameter is set to 3, DataNode startup fails only when four or more directories are faulty. The parameter value affects the DataNode startup.
To prevent DataNode faults, the value of dfs.datanode.failed.volumes.tolerated must be less than the number of configured volumes. You can also set dfs.datanode.failed.volumes.tolerated to -1, which is equivalent to n-1 (n indicates the number of volumes). This way, DataNode will be started normally.
- Save the change and restart the service or instance whose configuration has expired.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.