Insufficient Number of Replicas Is Reported During High Concurrent HDFS Writes

Symptom

File writes to HDFS fail occasionally.

The operation log is as follows:

105 | INFO  | IPC Server handler 23 on 25000 | IPC Server handler 23 on 25000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 192.168.1.96:47728 Call#1461167 Retry#0 | Server.java:2278 
java.io.IOException: File /hive/warehouse/000000_0.835bf64f-4103 could only be replicated to 0 nodes instead of minReplication (=1).  There are 3 datanode(s) running and 3 node(s) are excluded in this operation.

Cause Analysis

HDFS has a reservation mechanism for file writing: each block to be written is 128 MB no matter whether the file is 10 MB or 1 GB. If a 10 MB file needs to be written, the file occupies 10 MB of the first block and about 118 MB space will be released. If a 1 GB file needs to be written, HDFS writes the file block by block and releases unused space after the file is written.

If there are a large number of files to be written concurrently, the disk space for reserved write blocks is insufficient. As a result, the file fails to be written.

Solution

Log in to the HDFS WebUI and go to the JMX page of the DataNode.
1. On the native HDFS page, choose Datanodes.
2. Locate the target DataNode and click the HTTP address to go to the DataNode details page.
3. Change datanode.html in url to jmx.
Search for the XceiverCount indicator. If the value of this indicator multiplied by the block size exceeds the DataNode disk capacity, the disk space reserved for block write is insufficient.
You can use either of the following methods to solve the problem:

Method 1: Reduce the service concurrency.

Method 2: Combine multiple files into one file to reduce the number of files to be written.