HDFS Auto Recovery from Cluster Power-off

Scenario

HDFS data is written to the OS cache before being written to disks. HDFS considers that data write is complete after the data is written into the OS cache. The OS needs to write cached data to disks. If the cluster is powered off, the cached data will be lost and HDFS losses blocks. If this happens during HDFS startup, HDFS enters the safe mode and cannot be automatically recovered.

To solve this problem, HDFS provides the following configuration parameters to recover in the event of cluster power-off. You need to adjust the parameters based your needs.

If dfs.datanode.synconclose is true, the system considers that the write operation is complete only after the OS cache data is written to the disk. This prevents data loss caused by cluster power-off. However, this will demange HDFS write performance.
dfs.namenode.safemode.threshold-pct indicates the maximum percentage of blocks reported by DataNodes. If this threshold is reached, NameNodes automatically exit the safe mode. If this threshold is too small, there can be a large number of replicas during cluster startup.

This function is available in MRS 3.5.0 and later versions.

Procedure

Log in to FusionInsight Manager.
Choose Cluster > Services > HDFS and click the Configurations tab and then All Configurations.

Search for and set the following parameters as required.

Parameter	Description	Default Value
dfs.datanode.synconclose	If this parameter is set to false, block data will not be written into the disk immediately in the event of power outage or system restart during the process of storing files, which may result in data loss. If this parameter is set to true, data loss can be avoided in the event of power outage or system restart, but the performance deteriorates. Set this parameter based on the application scenario.	false
dfs.namenode.safemode.threshold-pct	Percentage of blocks that meet the minimum replication requirements defined by dfs.namenode.replication.min. Value range: 0 to 1.0 When the value is less than or equal to 0, NameNode will exit the safe mode without waiting for any blocks. When the value is greater than 1, NameNode permanently keeps in the safe mode.	0.999999

Parameter

Description

Default Value

dfs.datanode.synconclose

If this parameter is set to false, block data will not be written into the disk immediately in the event of power outage or system restart during the process of storing files, which may result in data loss. If this parameter is set to true, data loss can be avoided in the event of power outage or system restart, but the performance deteriorates. Set this parameter based on the application scenario.

false

dfs.namenode.safemode.threshold-pct

Percentage of blocks that meet the minimum replication requirements defined by dfs.namenode.replication.min. Value range: 0 to 1.0

When the value is less than or equal to 0, NameNode will exit the safe mode without waiting for any blocks.
When the value is greater than 1, NameNode permanently keeps in the safe mode.

0.999999

Save the configuration.
In the HDFS Instances tab, select all NameNode and DataNode instances, choose More > Instance Rolling Restart, verify the password, confirm the operation impact, and click OK. Wait until the rolling restart is complete.