Configuring HDFS Disk Balancing

Scenario

DiskBalancer is an online disk balancer that balances disk data on running DataNodes based on various indicators.

It works in the similar way of the HDFS Balancer. The differences are as follows:

The HDFS Balancer tool is used to balance data between DataNodes.
HDFS DiskBalancer is used to balance data between disks on a single DataNode.

Data among disks may be unevenly distributed if a large number of files have been deleted from a cluster running for a long time, or disk capacity expansion is performed on a node in the cluster. Uneven data distribution may deteriorate the concurrent read/write performance of the HDFS, or cause service failure due to inappropriate HDFS write policies. In this case, the data density among disks on a node needs to be balanced to prevent heterogeneous small disks from becoming the performance bottleneck of the node.

Notes and Constraints

This section applies to MRS 3.x or later.
DiskBalancer supports data migration only between disks of the same type, for example, from SSD to SSD and from DISK to DISK.
Enabling this function occupies disk I/O resources and network bandwidth resources of involved nodes. Enable this function in off-peak hours.
To troubleshoot performance issues, check the cluster event information for HDFS disk balancing events. If such events occurred, check whether DiskBalancer is enabled in the cluster.
After the automatic DiskBalancer function is enabled, the ongoing task stops only after the current data balancing is complete. The task cannot be canceled during the balancing.
You can manually specify certain nodes for data balancing on the client.

Configuring Automatic Disk Balancing

Log in to FusionInsight Manager.

For details about how to log in to FusionInsight Manager, see Accessing MRS Manager.
Choose Cluster > Services > HDFS > Configurations > All Configurations.

Search for the following parameters and change their values as required.

**Table 1** Parameters
Parameter	Description	Default Value
dfs.disk.balancer.auto.enabled	Indicates whether to enable the HDFS DiskBalancer function. The default value is false, indicating that this function is disabled.	false
dfs.disk.balancer.auto.cron.expression	Cron expression of the HDFS disk balancing operation, which is used to control the start time of the balancing operation. This parameter is valid only when dfs.disk.balancer.auto.enabled is set to true. The default value is *0 1 * 6**, indicating that tasks are executed at 01:00 every Saturday. For details about the expression, see Table 2.	0 1 * * 6
dfs.disk.balancer.max.disk.throughputInMBperSec	Specifies the maximum disk bandwidth that can be used for disk data balancing. The unit is MB/s, and the default value is 10. Set this parameter based on the actual disk conditions of the cluster.	10
dfs.disk.balancer.max.disk.errors	Specifies the maximum number of errors that are allowed in a specified movement process. If the value exceeds this threshold, the movement fails.	5
dfs.disk.balancer.block.tolerance.percent	Specifies the difference threshold between the data storage capacity and optimal status of each disk during data balancing among disks. For example, the ideal data storage capacity of each disk is 1 TB, and this parameter is set to 10. When the data storage capacity of the target disk reaches 900 GB, the storage status of the disk is considered as perfect. Value range: 1 to 100.	10
dfs.disk.balancer.plan.threshold.percent	Specifies the data density difference that is allowed between two disks during disk data balancing. If the absolute value of the data density difference between any two disks exceeds the threshold, data balancing is required. Value range: 1 to 100.	10
dfs.disk.balancer.top.nodes.number	Specifies the top N nodes whose disk data needs to be balanced in the cluster. The returned DataNode list is refreshed continuously. Therefore, you do not need to set this parameter to a large value.	5

Table 2 lists the CRON expressions used for HDFS disk balancing. To use this function, set dfs.disk.balancer.auto.enabled to true. Set other parameters based on the cluster status.

**Table 2** CRON expressions
Column	Description
1	Minute. The value ranges from 0 to 59.
2	Hour. The value ranges from 0 to 23.
3	Date. The value ranges from 1 to 31.
4	Month. The value ranges from 1 to 12.
5	Week. The value ranges from 0 to 6. 0 indicates Sunday.

Click Save to make configurations take effect. You do not need to restart the HDFS service.

Manually Performing Disk Balancing

Install the client. If the client has been installed, skip this step.

For example, the installation directory is /opt/client. You need to change it to the actual installation directory.

For details about how to download and install the cluster client, see Installing an MRS Cluster Client.
Log in to the node where the client is installed as the client installation user.
Go to the client installation directory, for example, /opt/client.
```
cd /opt/client
```
Run the following command to configure environment variables:
```
source bigdata_env
```
If the cluster is in security mode, run the following command to authenticate the user. If the cluster is in normal mode, skip this step. The user must have the supergroup permission.
```
kinit Component service user
```

Run the following command to use the DiskBalancer function as required:

**Table 3** DiskBalancer commands
Syntax	Description
hdfs diskbalancer -report -top <N>	Queries the top N nodes whose disk data needs to be balanced in a cluster. Set N to an integer greater than 0.
hdfs diskbalancer -plan <Hostname\| IP Address>	Generates a data balancing plan for a specified node to optimize its data distribution. This command can be used to generate a JSON file based on the DataNode. The file contains information about the source disk, target disk, and blocks to be moved. Hostname indicates the hostname of the node, and IP Address indicates the IP address of the node. To obtain the IP address and hostname of the target DataNode, log in to Manager and choose Cluster > Services > HDFS > Instances. You can also add the following parameters to this command: (Optional) *-threshold <value>: Sets the imbalance threshold. Only nodes that exceed the threshold will be processed. (Optional) -bandwidth <value>*: Sets the bandwidth limit for data migration to ensure cluster stability.
hdfs diskbalancer -query <Hostname:port>	Queries the running status of the DiskBalancer task running on a specified node. Hostname: Specifies the hostname of the node. To obtain the hostname of the target DataNode, log in to Manager, choose Cluster > Services > HDFS > Instances. On the displayed page, view and record the hostname of the target DataNode. port: Specifies the port number of the DataNode IPC server. To obtain the port number, log in to Manager, choose Cluster > Services > HDFS > Configurations > All Configurations, search for dfs.datanode.ipc.port, and record its value. The default value is 9867.
hdfs diskbalancer -execute <planfile>	Execute a specified disk balancing plan. planfile indicates the JSON file generated by running the hdfs diskbalancer -plan <Hostname\| IP Address> command. Use an absolute path.
hdfs diskbalancer -cancel <planfile>	Cancels a running disk balancing plan. planfile indicates the JSON file generated by running the hdfs diskbalancer -plan <Hostname\| IP Address> command. Use an absolute path.
hdfs diskbalancer -help <command>	Provides more help information about disk balancing commands.