Help Center/ MapReduce Service/ Component Operation Guide (Normal)/ Using MapReduce/ Common Issues About MapReduce/ What Should I Do If the Partition-based Task Blacklist Is Abnormal?
Updated on 2024-10-08 GMT+08:00

What Should I Do If the Partition-based Task Blacklist Is Abnormal?

Question

The Map&Reduce task fails, and the ratio of the number of faulty nodes to the total number of nodes in the cluster is lower than the blacklist threshold specified by yarn.resourcemanager.am-scheduling.node-blacklisting-disable-threshold. Why is the faulty node not added to the blocklist?

Answer

If the number of blocked nodes exceeds the threshold, all blocked nodes are released. The threshold is based on the ratio of fault nodes to all nodes in the cluster. Currently, each node has a label expression. The blocklist threshold is calculated based on the number of nodes relate to effective node labels. In other way, the blocklist threshold is the ratio of fault nodes to relate to effective node labels.

Assume that there are 100 nodes in the cluster, including 10 nodes (labelA) with valid node label expressions. Assume that all nodes relate to valid node label expressions are faulty and default blocklist threshold is 0.33. After calculation, 10/100 = 0.1, which is far smaller than the threshold. In this case, the 10 nodes will never get released. Therefore, MapReduce tasks always cannot obtain nodes and applications cannot run properly. In practice, the threshold needs to be calculated based on the total number of nodes relate to valid node label expressions: 10/10 = 1 is greater than the blacklist release threshold and all nodes are released.

Therefore, even the ratio of fault nodes to all nodes in the clusters is within the threshold, all nodes in the blocklist are released.