Configuring the Label Policy (NodeLabel) for HDFS File Directories
Scenario
You need to configure the nodes for storing HDFS file data blocks based on data features. You can specify which DataNodes store data blocks for files by setting a label expression to match an HDFS directory/file and assigning one or more labels to each DataNode.
To place blocks with a label-based strategy, DataNodes that match the file's label expression are selected and then a suitable node among them is chosen.
When adjusting the HDFS data block replication policy, you must:
- Ensure data reliability and integrity.
- Minimize cross-rack data transmission to improve transmission efficiency.
- Balance the load of nodes.
- Perform sufficient tests to ensure that the custom policy can work properly.
Proper configuration of data block replication policies enables HDFS to better adapt to different application scenarios, improving the performance and reliability of the entire cluster.
- Scenario 1: DataNode partitioning
When different application data is required to run on different nodes for separate management, label expressions can be used to achieve separation of different services, storing specified services on corresponding nodes.
By configuring the NodeLabel feature, you can perform the following operations:
- Store data in /HBase to DN1, DN2, DN3, and DN4.
- Store data in /Spark to DN5, DN6, DN7, and DN8.
- Run the hdfs nodelabel -setLabelExpression -expression 'LabelA[fallback=NONE]' -path /Hbase command to set an expression for the HBase directory. As shown in Figure 1, the data block replicas of files in the /Hbase directory are placed on the nodes labeled with the LabelA, that is, DN1, DN2, DN3, and DN4.
Similarly, run the hdfs nodelabel -setLabelExpression -expression 'LabelB[fallback=NONE]' -path /Spark command to set an expression for the Spark directory. Data block replicas of files in the /Spark directory can be placed only on nodes labeled with LabelB, that is, DN5, DN6, DN7, and DN8.
- For details about how to set labels for a data node, see Configuring the Data Block Replication Policy for DataNode Nodes.
- If a cluster has multiple racks, each label can contain DataNodes of multiple racks to ensure reliability of data block placement.
- Scenario 2: Specifying replica location when there are multiple racks
In a heterogeneous cluster, allocate nodes with high reliability for storing important business data. Specify the replica locations using label expressions, and store one replica of file data blocks on a high-reliability node.
Data blocks in the /data directory have three replicas by default. In this case, at least one replica is stored on a node of RACK1 or RACK2 (nodes of RACK1 and RACK2 are high reliable), and the other two are stored separately on the nodes of RACK3 and RACK4.
Figure 2 Scenario example- Run the hdfs nodelabel -setLabelExpression -expression 'LabelA||LabelB[fallback=NONE],LabelC,LabelD' -path /data command to set an expression for the /data directory.
- When data is to be written to the /data directory, at least one data block replica is stored on a node labeled with the LabelA or LabelB, and the other two data block replicas are stored separately on the nodes labeled with the LabelC and LabelD.
Notes and Constraints
- This section applies to MRS 3.x or later.
- In configuration files, the key and value are separated by equation signs (=), colons (:), and spaces. Therefore, the hostname of the key cannot contain these characters.
Configuring the Data Block Replication Policy for DataNode Nodes
- Log in to FusionInsight Manager.
For details about how to log in to FusionInsight Manager, see Accessing MRS Manager.
- Choose Cluster > Services > HDFS > Configurations > All Configurations.
- Search for the following parameters and change their values as required.
- Click Save. Go to the Instances page and check whether there are instances whose configurations have expired. If yes, select the instances and choose More > Restart Instance. The configurations take effect after the restart.
- Then log in to the HDFS client by referring to Using the HDFS Client and run the following command to view the label information of each DataNode:
hdfs nodelabel -listNodeLabels [-all] [-node <node_name>]
- -all: displays all label groups, including labels that are not associated with any node. By default, only labels associated with nodes are displayed.
- -node <name>: views the label groups allocated to a specified node (hostname or IP address).
Setting Label Expressions for HDFS Directories and Files
- Configuring Labels On Manager
- Log in to FusionInsight Manager.
For details about how to log in to FusionInsight Manager, see Accessing MRS Manager.
- Choose Cluster > Services > HDFS > Configurations > All Configurations.
- Search for the following parameters and change their values as required.
Parameter
Description
path2expression
Configures the mapping between HDFS directories and labels.
- If the configured HDFS directory does not exist, the configuration can succeed. When a directory with the same name is created manually, the configured label mapping relationship will be inherited by the directory within 30 minutes.
- After a labeled directory is deleted, a new directory with the same name as the deleted one will inherit its mapping within 30 minutes.
- Click Save for the configuration to take effect. You do not need to restart the HDFS service.
- Run the following command to check whether the directory label takes effect on the HDFS Client by referring to Configuring Labels on the Cluster Client:
hdfs nodelabel -listLabelExpression -path <path>
In the preceding command, <path> indicates the HDFS directory to be checked.
- Log in to FusionInsight Manager.
- Configuring Labels on the Cluster Client
- Install the client. If the client has been installed, skip this step.
For example, the installation directory is /opt/client. You need to change it to the actual installation directory.
For details about how to download and install the cluster client, see Installing an MRS Cluster Client.
- Log in to the node where the client is installed as the client installation user.
- Go to the client installation directory, for example, /opt/client.
cd /opt/client
- Run the following command to configure environment variables:
source bigdata_env
- If the cluster is in security mode, run the following command to authenticate the user. If the cluster is in normal mode, skip this step.
kinit Component service user
- Run the following command to set the label expression of the directory or file:
hdfs nodelabel -setLabelExpression <expression> -add <label1,label2,...> hdfs nodelabel -setLabelExpression <expression> -remove <label1,label2,...>
- <expression>: expression of a node, which supports the following syntax:
- Hostname/IP address: For example, host1.test.com or 192.168.1.100.
- Wildcard: Asterisk (*) indicates any character is matched, and question mark (?) indicates a single character is matched. For example, *.test.com.
- Regular expression: Starts with tilde (~), for example, ~.*\.test\.com.
- -add: Adds a label to the matched node.
- -remove: Removes labels from the matched nodes.
For example, add the hot label to data-node-[1-5].
hdfs nodelabel -setLabelExpression "data-node-[1-5]" -add hot
- <expression>: expression of a node, which supports the following syntax:
- Install the client. If the client has been installed, skip this step.
- Configuring Labels Through Java APIs
To set label expressions through the Java API, create an instance of the NodeLabelFileSystem class and use the instance to invoke the setLabelExpression(String src, String labelExpression). src indicates a directory or file path on HDFS, and labelExpression indicates the label expression.
Block Replica Location Selection
NodeLabel supports different label policies for replicas. The expression label-1,label-2,label-3 indicates that three replicas are respectively placed in DataNodes containing label-1, label-2, and label-3. Different replica policies are separated by commas (,).
If you want to place two replicas in DataNode with label-1, set the expression as follows: label-1[replica=2],label-2,label-3. In this case, if the default number of replicas is 3, two nodes with label-1 and one node with label-2 are selected. If the default number of replicas is 4, two nodes with label-1, one node with label-2, and one node with label-3 are selected. Note that the number of replicas is the same as that of each replica policy from left to right. However, the number of replicas sometimes exceeds the expressions. If the default number of replicas is 5, the extra replica is placed on the last node, that is, the node labeled with label-3.
When the ACLs function is enabled and the user does not have the permission to access the labels used in the expression, the DataNode with the label is not selected for the replica.
Redundant Block Replica Deletion
If the number of block copies exceeds the value of dfs.replication, HDFS deletes redundant block copies to improve resource utilization. dfs.replication indicates the number of file copies allowed. Go to the HDFS service configuration page by referring to Modifying Cluster Service Configuration Parameters, and search for the parameter to view its value.
The deletion rules are as follows:
- Preferentially delete replicas that do not meet any expression.
For example: The default number of file replicas is 3.
The label expression of /test is LA[replica=1],LB[replica=1],LC[replica=1];
The file replicas of /test are distributed on four nodes (D1 to D4), corresponding to labels (LA to LD).
D1:LA D2:LB D3:LC D4:LD
Then, block replicas on node D4 will be deleted.
- If all replicas meet the expressions, delete the redundant replicas which are beyond the number specified by the expression.
For example: The default number of file replicas is 3.
The label expression of /test is LA[replica=1],LB[replica=1],LC[replica=1];
The file replicas of /test are distributed on the following four nodes, corresponding to the following labels.
D1:LA D2:LA D3:LB D4:LC
Then, block replicas on node D1 or D2 will be deleted.
- If a file owner or group of a file owner cannot access a label, preferentially delete the replica from the DataNode mapped to the label.
Example of label-based block placement policy
Assume that there are six DataNodes, namely, dn-1, dn-2, dn-3, dn-4, dn-5, and dn-6 in a cluster and the corresponding IP address range is 10.1.120.[1-6]. Six directories must be configured with label expressions. The default number of block replicas is 3.
- The following provides three expressions of the DataNode label in host2labels file. The three expressions have the same function.
- Regular expression of the host name
/dn-[1456]/ = label-1,label-2 /dn-[26]/ = label-1,label-3 /dn-[3456]/ = label-1,label-4 /dn-5/ = label-5
- IP address range expression
10.1.120.[1-6] = label-1 10.1.120.1 = label-2 10.1.120.2 = label-3 10.1.120.[3-6] = label-4 10.1.120.[4-6] = label-2 10.1.120.5 = label-5 10.1.120.6 = label-3
- Common host name expression
/dn-1/ = label-1, label-2 /dn-2/ = label-1, label-3 /dn-3/ = label-1, label-4 /dn-4/ = label-1, label-2, label-4 /dn-5/ = label-1, label-2, label-4, label-5 /dn-6/ = label-1, label-2, label-3, label-4
- Regular expression of the host name
- The label expressions of the directories are set as follows:
/dir1 = label-1 /dir2 = label-1 && label-3 /dir3 = label-2 || label-4[replica=2] /dir4 = (label-2 || label-3) && label-4 /dir5 = !label-1 /sdir2.txt = label-1 && label-3[replica=3,fallback=NONE] /dir6 = label-4[replica=2],label-2
For details about how to set label expressions, see Configuring Labels on the Cluster Client.
The file data block storage locations are as follows:
- Data blocks of files in the /dir1 directory can be stored on any of the following nodes: dn-1, dn-2, dn-3, dn-4, dn-5, and dn-6.
- Data blocks of files in the /dir2 directory can be stored on the dn-2 and dn-6 nodes. The default number of block replicas is 3. The expression matches only two DataNodes. The third replica will be stored on one of the remaining nodes in the cluster.
- Data blocks of files in the /dir3 directory can be stored on any three of the following nodes: dn-1, dn-3, dn-4, dn-5, and dn-6.
- Data blocks of files in the /dir4 directory can be stored on the dn-4, dn-5, and dn-6 nodes.
- Data blocks of files in the /dir5 directory do not match any DataNode and will be stored on any three nodes in the cluster, which is the same as the default block selection policy.
- For the data blocks of the /sdir2.txt file, two replicas are stored on the dn-2 and dn-6 nodes. The left one is not stored in the node because fallback=NONE is enabled.
- Data blocks of the files in the /dir6 directory are stored on the two nodes with label-4 selected from dn-3, dn-4, dn-5, and dn-6 and another node with label-2. If the specified number of file replicas in the /dir6 directory is more than 3, the extra replicas will be stored on a node with label-2.
Helpful Links
- In the resource pool scenario, the cluster is divided into two resource pools using the NodeLabel feature. All nodes in the YARN resource pool are blacklisted, causing tasks to remain in the running state. To rectify the fault, see Why Does Yarn Not Release the Blacklist Even All Nodes Are Added to the Blacklist?.
- For more information about NodeLabel, see YARN Node Labels.
Feedback
Was this page helpful?
Provide feedbackThank you very much for your feedback. We will continue working to improve the documentation.See the reply and handling status in My Cloud VOC.
For any further questions, feel free to contact us through the chatbot.
Chatbot